[Core][gcs] Fix replica actor zombie process issue after GCS restart#63764
Conversation
Problem: After GCS restart, the local_raylet_address_ for ALIVE actors is not restored, which prevents GCS from sending KillLocalActorRequest to clean up zombie actor processes. Without this address, GCS logs 'Actor has not been assigned a lease' and the worker process remains alive. Solution: In GcsActorManager::Initialize(), restore local raylet address for ALIVE actors on ALIVE nodes. The address is retrieved from persisted node info in Redis. Tests: - TestInitializeRestoresLocalRayletAddressForAliveActors - TestInitializeDoesNotRestoreLocalRayletAddressForDeadNodes Files modified: - src/ray/gcs/actor/gcs_actor_manager.cc (+21 lines) - src/ray/gcs/actor/tests/gcs_actor_manager_test.cc (+212 lines) Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
There was a problem hiding this comment.
Code Review
This pull request restores the local raylet address for ALIVE actors upon GCS restart, preventing zombie worker processes by ensuring GCS can send KillLocalActorRequests. It also introduces unit tests to verify this behavior. However, the newly added tests fail to compile because the GcsActorManager constructor is invoked with missing arguments (observability_publisher and clock).
Add missing observability_publisher and clock parameters to GcsActorManager constructor calls in unit tests. Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
The MockActorScheduler class incorrectly added GetPendingActorsCount() and CancelInFlightActorScheduling() methods that don't exist in the base GcsActorSchedulerInterface interface, causing compilation errors. These methods were likely copied from GcsActorManager but are not part of the scheduler interface contract. Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
| auto node_it = nodes.find(node_id); | ||
| // Only restore local raylet address if the node is still alive. | ||
| // If the node is dead, the raylet is unreachable anyway. | ||
| if (node_it != nodes.end() && |
There was a problem hiding this comment.
probably outside the scope of this pr, but maybe we should consider handling cases where an actor is considered alive while the node it is running on is dead (retrigger the GcsActorManager::OnNodeDead path maybe)
There was a problem hiding this comment.
I agree this is a larger recovery gap worth addressing, but I’d prefer to keep that separate from this PR. This change only restores the missing local_raylet_address_ during initialization.
| ASSERT_EQ(actor->GetState(), rpc::ActorTableData::ALIVE); | ||
| } | ||
|
|
||
| TEST_F(GcsActorManagerTest, TestInitializeRestoresLocalRayletAddressForAliveActors) { |
There was a problem hiding this comment.
nit: the two tests have some duplicated code. If possible we could reuse some of the code
There was a problem hiding this comment.
The dead-node test has been removed, and the remaining setup is now shared through the test helpers, so the duplication here should be much smaller now.
|
thanks for working on this fix, overall the changes look good to me. There appear to be some build issues could you please fix them. |
|
Thank you @sampan-s-nayak and @rueian for the timely review and valuable feedback. I updated the recovery logic to restore For the |
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
sampan-s-nayak
left a comment
There was a problem hiding this comment.
Thanks for working on this!
…ay-project#63764) Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
…e process issue after GCS restart
…ay-project#63764) Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Motivation
This change fixes a bug where actor processes become unkillable zombies after GCS restart. The issue occurs because the
local_raylet_address_field, which is required for sending termination requests, is not persisted as part of GCS FT and was not being restored during GCS initialization.Impact:
Related issue
Fixed #63763
Implementation Details
Files Changed:
src/ray/gcs/actor/gcs_actor_manager.ccsrc/ray/gcs/actor/tests/gcs_actor_manager_test.ccKey Changes:
1. GCS actor recovery fix (
gcs_actor_manager.cc)Restore
local_raylet_address_inGcsActorManager::Initialize()for restoredALIVEactors using the actor's assigned node metadata:This keeps
LocalRayletAddress()consistent with the actor's assigned node duringrecovery and allows later cleanup paths to send
KillLocalActorRequest.2. Regression test (
gcs_actor_manager_test.cc)Added
TestInitializeRestoresLocalRayletAddressForAliveActors, plus test-onlyhelpers for constructing
GcsInitDataand inspecting restored actors.Test Coverage:
ALIVEactor is restored duringGcsActorManager::Initialize()local_raylet_address_is reconstructed from persisted node metadataScope
This PR is limited to restoring the missing
local_raylet_address_duringinitialization.
It does not change broader recovery behavior for cases where persisted actor state
and node liveness may already be inconsistent after restart.
Verification
Automated Tests
Run the GCS actor manager test target:
bazel test //src/ray/gcs/actor/tests:gcs_actor_manager_testTest Assertions Verified