Skip to content

[Core][gcs] Fix replica actor zombie process issue after GCS restart#63764

Merged
sampan-s-nayak merged 4 commits into
ray-project:masterfrom
daiping8:replica_actor
Jun 2, 2026
Merged

[Core][gcs] Fix replica actor zombie process issue after GCS restart#63764
sampan-s-nayak merged 4 commits into
ray-project:masterfrom
daiping8:replica_actor

Conversation

@daiping8

@daiping8 daiping8 commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Motivation

This change fixes a bug where actor processes become unkillable zombies after GCS restart. The issue occurs because the local_raylet_address_ field, which is required for sending termination requests, is not persisted as part of GCS FT and was not being restored during GCS initialization.

Impact:

  • Zombie actor processes consume resources indefinitely
  • Cannot clean up actors after GCS restart
  • Resource leaks in long-running clusters with GCS restarts

Related issue

Fixed #63763

Implementation Details

Files Changed:

  1. src/ray/gcs/actor/gcs_actor_manager.cc
  2. src/ray/gcs/actor/tests/gcs_actor_manager_test.cc

Key Changes:

1. GCS actor recovery fix (gcs_actor_manager.cc)

Restore local_raylet_address_ in GcsActorManager::Initialize() for restored
ALIVE actors using the actor's assigned node metadata:

} else if (actor_table_data.state() == ray::rpc::ActorTableData::ALIVE) {
  created_actors_[actor->GetNodeID()].emplace(actor->GetWorkerID(),
                                              actor->GetActorID());
  // Restore local raylet address for ALIVE actors after GCS restart.
  // This is necessary because local_raylet_address_ is not persisted with GCS FT,
  // but is required to send KillLocalActorRequest when destroying the actor.
  if (!actor->GetNodeID().IsNil()) {
    const auto &node_id = actor->GetNodeID();
    const auto &nodes = gcs_init_data.Nodes();
    auto node_it = nodes.find(node_id);
    if (node_it != nodes.end()) {
      const auto &node_info = node_it->second;
      rpc::Address actor_local_raylet_address;
      actor_local_raylet_address.set_node_id(node_info.node_id());
      actor_local_raylet_address.set_ip_address(node_info.node_manager_address());
      actor_local_raylet_address.set_port(node_info.node_manager_port());
      actor->UpdateLocalRayletAddress(actor_local_raylet_address);
    }
  }
}

This keeps LocalRayletAddress() consistent with the actor's assigned node during
recovery and allows later cleanup paths to send KillLocalActorRequest.

2. Regression test (gcs_actor_manager_test.cc)

Added TestInitializeRestoresLocalRayletAddressForAliveActors, plus test-only
helpers for constructing GcsInitData and inspecting restored actors.

Test Coverage:

  • Verifies an ALIVE actor is restored during GcsActorManager::Initialize()
  • Verifies local_raylet_address_ is reconstructed from persisted node metadata
  • Verifies the restored address matches the node manager address and port

Scope

This PR is limited to restoring the missing local_raylet_address_ during
initialization.

It does not change broader recovery behavior for cases where persisted actor state
and node liveness may already be inconsistent after restart.

Verification

Automated Tests

Run the GCS actor manager test target:

bazel test //src/ray/gcs/actor/tests:gcs_actor_manager_test

Test Assertions Verified

ASSERT_EQ(RegisteredActorCount(*test_gcs_actor_manager), 1u);
ASSERT_TRUE(local_raylet_address.has_value());
ASSERT_EQ(local_raylet_address->node_id(), node_id.Binary());
ASSERT_EQ(local_raylet_address->ip_address(), "127.0.0.1");
ASSERT_EQ(local_raylet_address->port(), 9999);
Problem:
After GCS restart, the local_raylet_address_ for ALIVE actors is not restored,
which prevents GCS from sending KillLocalActorRequest to clean up zombie actor
processes. Without this address, GCS logs 'Actor has not been assigned a lease'
and the worker process remains alive.

Solution:
In GcsActorManager::Initialize(), restore local raylet address for ALIVE actors
on ALIVE nodes. The address is retrieved from persisted node info in Redis.

Tests:
- TestInitializeRestoresLocalRayletAddressForAliveActors
- TestInitializeDoesNotRestoreLocalRayletAddressForDeadNodes

Files modified:
- src/ray/gcs/actor/gcs_actor_manager.cc (+21 lines)
- src/ray/gcs/actor/tests/gcs_actor_manager_test.cc (+212 lines)

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
@daiping8 daiping8 requested a review from a team as a code owner June 1, 2026 08:18

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request restores the local raylet address for ALIVE actors upon GCS restart, preventing zombie worker processes by ensuring GCS can send KillLocalActorRequests. It also introduces unit tests to verify this behavior. However, the newly added tests fail to compile because the GcsActorManager constructor is invoked with missing arguments (observability_publisher and clock).

Comment thread src/ray/gcs/actor/tests/gcs_actor_manager_test.cc Outdated
Comment thread src/ray/gcs/actor/tests/gcs_actor_manager_test.cc Outdated
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jun 1, 2026
daiping8 added 2 commits June 1, 2026 17:16
Add missing observability_publisher and clock parameters to
GcsActorManager constructor calls in unit tests.

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
The MockActorScheduler class incorrectly added GetPendingActorsCount() and
CancelInFlightActorScheduling() methods that don't exist in the base
GcsActorSchedulerInterface interface, causing compilation errors.

These methods were likely copied from GcsActorManager but are not part of
the scheduler interface contract.

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Jun 1, 2026
@rueian rueian self-assigned this Jun 1, 2026
Comment thread src/ray/gcs/actor/gcs_actor_manager.cc Outdated
Comment thread src/ray/gcs/actor/gcs_actor_manager.cc Outdated
Comment thread src/ray/gcs/actor/gcs_actor_manager.cc Outdated
auto node_it = nodes.find(node_id);
// Only restore local raylet address if the node is still alive.
// If the node is dead, the raylet is unreachable anyway.
if (node_it != nodes.end() &&

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably outside the scope of this pr, but maybe we should consider handling cases where an actor is considered alive while the node it is running on is dead (retrigger the GcsActorManager::OnNodeDead path maybe)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a larger recovery gap worth addressing, but I’d prefer to keep that separate from this PR. This change only restores the missing local_raylet_address_ during initialization.

Comment thread src/ray/gcs/actor/gcs_actor_manager.cc Outdated
ASSERT_EQ(actor->GetState(), rpc::ActorTableData::ALIVE);
}

TEST_F(GcsActorManagerTest, TestInitializeRestoresLocalRayletAddressForAliveActors) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the two tests have some duplicated code. If possible we could reuse some of the code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dead-node test has been removed, and the remaining setup is now shared through the test helpers, so the duplication here should be much smaller now.

@sampan-s-nayak

Copy link
Copy Markdown
Contributor

thanks for working on this fix, overall the changes look good to me. There appear to be some build issues could you please fix them.

@daiping8

daiping8 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Thank you @sampan-s-nayak and @rueian for the timely review and valuable feedback.

I updated the recovery logic to restore local_raylet_address_ whenever the actor has a non-nil assigned node_id, without additionally requiring a non-nil worker_id or an ALIVE node entry. I also updated the comment to refer to GCS FT instead of Redis, and simplified the test setup by reusing helpers and keeping only the positive regression case. I also updated thePR description.

For the ALIVE actor on a DEAD node case, I agree that it likely needs a broader recovery follow-up, but I kept that out of scope for this PR.

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

@sampan-s-nayak sampan-s-nayak left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

@sampan-s-nayak sampan-s-nayak merged commit 771638b into ray-project:master Jun 2, 2026
6 checks passed
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
stekkaligrid pushed a commit to Wizak-2601/ray that referenced this pull request Jun 26, 2026
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

4 participants