[train][torchft] Ray Train manages replica group restarts by TimothySeah · Pull Request #61475 · ray-project/ray

TimothySeah · 2026-03-04T03:37:15Z

Summary

This PR follows up on #61156 by handling torchft worker group failure recovery.

Here are some of the design decisions:

ReplicaGroup.shutdown is similar to WorkerGroup.shutdown (they both shut down workers and clear state) but doesn't do some other stuff (e.g. callbacks and placementgroup cleanup).
WorkerGroup.replace_replica_group is similar to WorkerGroup._start_impl so I refactored their shared functionality accordingly. The main difference is that the former runs fewer callbacks.

This PR also changes the semantics of get_world_rank and get_local_rank/get_node_rank:

get_world_rank/get_world_size still apply to all workers (across replica groups). This is because we often need one global rank 0 e.g. DDP rank 0 worker uploads checkpoint. Since we are doing DDP only, this is equivalent to get_replica_group_id.
get_local_rank/get_local_size/get_node_rank now apply to (replica group, node) pairs. One common use case for get_local_rank is to only download data to a node once - but since every replica group gets a different shard of data, every "local rank 0" needs to download data anyway. Also, creating a local_rank across all replica groups on the same node might not be feasible; if there is a node failure that results in a replica group getting scheduled on an existing node with another replica group, we would need to re-sort all the local ranks. Fortunately, every torchft replica group is effectively its own torchrun group (communication between torchrun groups is handled by the Manager), so this is consistent with their model. For this reason we don't need to worry about CUDA_VISIBLE_DEVICES either.
The order of workers in WorkerGroupState.workers is equivalent to get_world_rank. This invariant still holds.

I also went through every single WorkerGroupCallback method and determined whether or not they are relevant for ReplicaGroups. In many cases, the behavior is the same for WorkerGroups and ReplicaGroups, so I also defined corresponding ExecutionGroupCallback methods that get called in both cases by default.

before_init_train_context: Always the same between WorkerGroup and ReplicaGroup so I moved this behavior up to ExecutionGroupCallback.

AcceleratorSetupCallback: well handled and tested in this PR
CheckpointManagerCallback and ValidationManagerCallback: same behavior. Will test this more thoroughly in the followup ray.train.report PR.
DatasetSetupCallback: will test this more in the followup data integration PR.

before_worker_group_shutdown: sometimes same (in which case the user can override before_execution_group_shutdown) sometimes different (in which case the user can override before_worker_group_shutdown or before_replica_group_shutdown).

BackendSetupCallback: implemented in before_execution_group_shutdown. Well handled and tested in this PR
StateManagerCallback: only implemented with before_worker_group_shutdown because replica groups are not reflected in train run state
ReportCallbackHandler: will be fixed in followup ray.train.report PR. The main idea is that before_replica_group_shutdown can also clear report states.

after_worker_group_start: same as before_worker_group_shutdown

BackendSetupCallback and WorkingDirectorySetupCallback: implemented in before_execution_group_shutdown. Well handled and tested in this PR
DatasetSetupCallback and ReportCallbackHandler will be handled in the aforementioned future PR's
StateManager and PlacementGroupCleanerCallback only apply to worker groups.

after_worker_group_shutdown and after_worker_group_abort are only used by DatasetsCallback. They should only apply to worker groups; when a replica group goes down, rather than shutting down any state, we simply send the state to the replacement worker. Of course, the aforementioned future data integration PR will test this better.

All other WorkerGroupCallback methods are irrelevant:

on_worker_group-start/on_worker_group_shutdown are just for timing the worker group.
before_worker_group_start and before_worker_group_abort are only used by StateManagerCallback which is irrelevant as explained earlier.
after_worker_group_training_start is never used.
after_worker_group_poll_status is irrelevant because it operates on all the workers in the worker group, while replica groups are just thin wrappers around the workers.

Testing

I'm open to more unit test suggestions. I basically tried to unit test different layers of the stack as follows:

test_torch_trainer: e2e test. I also verified that it works as expected. It's still disabled until I add torchft dependencies to the train CI.
test_controller: tests that we correctly decide when to do a replica group restart or a full worker group restart. Note that this also tests elastic training.
test_worker_group: tests that when we replace_replica_group we correctly update the relevant state (WorkerGroupState, replica groups, polling state). I added mark.parametrize to other unit tests to verify other behavior works as expected with both worker group and replica group restarts e.g. callbacks and worker initialization.

Successfully ran a driver script in a workspace with simulated node failures (killed the raylet) and confirmed that it worked.

Driver script: https://gist.github.com/TimothySeah/35aa96b81b2d98d77c23b10c0baa71c9

Logs

Failure detected: https://gist.github.com/TimothySeah/923ce69987de98b774d8de214e6845ae

Training stops due to unmet quorum

�[36m(LighthouseServerActor pid=24407)�[0m 2026-03-11T13:02:09.398 [INFO] [torchft::lighthouse] - Quorum status: New quorum not ready, only have 2 participants, need min_replicas 3 [2/2 participants healthy][2 heartbeating][shrink_only=false]
�[33m(raylet)�[0m The node with node id: 0b015b238c8d9b950c92fe08a188001514a61a624cebdfe0e2aab715 and address: 10.0.82.255 and node name: 10.0.82.255 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a 	(1) raylet crashes unexpectedly (OOM, etc.) 
	(2) raylet has lagging heartbeats due to slow network or busy workload

replace_replica_group fails because node is unschedulable: https://gist.github.com/TimothySeah/12030fe7814a72e1b21fd75c25710635

Autoscaling completes and training continues: https://gist.github.com/TimothySeah/98b0776199ff18a97ce64246ab7894c9

Eventually we reach the finished state.

Signed-off-by: Timothy Seah <tseah@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces support for replica group restarts, a key feature for fault tolerance with torchft. The changes are well-structured, introducing ExecutionGroup and ExecutionGroupCallback as base classes to share logic between WorkerGroup and the new ReplicaGroup concepts. The controller logic is updated to handle partial restarts of failing replica groups, and the WorkerGroup is enhanced with a replace_replica_group method. The refactoring is clean and the new functionality is supported by a comprehensive set of tests. I have one suggestion regarding the callback handling to make it more robust for custom callbacks.

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah

I think we should hold off on torchft-native elastic training in this PR. Consider the core and data dependencies below:

Ray Data Ray Core

Fixed training Easy: move data shard from worker x to its replacement None

Elastic training Hard: must reconfigure data shards Easy option: 1 PlacementGroup per replica group (size 1 for DDP). This is an antipattern. Hard option: Resizable PlacementGroup

Because it’s easy to get Ray Data + Ray Train + torchft to work for fixed training - and elastic training still works because we can just fall back to worker group restarts - I would suggest addressing this in a future PR.

Instead of having 1 WorkerGroup with many ReplicaGroups, should we just have multiple WorkerGroups? I prefer 1 WorkerGroup with many ReplicaGroups because:

Single controller single layer: We want to simultaneously poll all the workers. Similarly, we want to have a single SyncActor across all the workers to enable a report barrier. Having multiple WorkerGroups could complicate this.
WorkerGroup and ReplicaGroup have different callbacks. For example, WorkerGroup.on_start should set up one process group per replica group.
1 WorkerGroup also translates more naturally to the Ray Train dashboard - showing 1 WorkerGroup per replica group in the UI would add unnecessary detail and complexity.
It makes sense to have ReplicaGroup as a first class concept because it is data parallel group and it is intuitive to shard 1 dataset across the data parallel groups of 1 workergroup.

TimothySeah · 2026-03-07T01:58:45Z

Question: how will state transitions look?
Answer: TrainRun transitions will be the same (RUNNING → RESTARTING → SCHEDULING → RUNNING) but there will be no TrainRunAttempt.

…plica groups but everything else is not Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah · 2026-03-11T22:17:15Z

Sanity check audit of places that use get_local_rank

Set env vars: torch, torch xla. Not an issue: every replica group is independent
Raytrainreportcallback lightning: only local rank 0 clears checkpoint directory. Not an issue: torchft team has not documented torchft + lightning, in the worst case 2 workers deleting the same directory isn’t a big deal. Separate issue: 2 workers write to same checkpoint right now
Prepare_model. Not an issue: just toggles printing. After this change, we do get_devices()[0] where get_devices() will only return 1 device (e.g. 2), as opposed to get_devices()[2].
Resent.ipynb and pytorch_resent_finetune.ipynb: download dataset once per node. Not an issue: every replica group has different data, and we should use ray data splitting instead
Persistent-storage.rst: every local rank 0 worker saves artifacts. Not an issue: every replica group should produce different artifacts

justinvyu

have a few more things to look at and the tests, but here's a few comments for now

…se-1b

Signed-off-by: Timothy Seah <tseah@anyscale.com>

…e-1b

Signed-off-by: Timothy Seah <tseah@anyscale.com>

…se-1b

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu

A few high level thoughts on the design. These are non-blocking but we can discuss offline

…se-1b

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu

🚢

justinvyu · 2026-04-07T00:38:32Z

+            # After sorting: node0/gpu0, node0/gpu1, node1/gpu0, node1/gpu1
+            # Each worker is its own replica group, so local_rank=0,
+            # local_world_size=1, node_rank=0 for all.
+            [
+                DistributedContext(


thanks for the test, makes the node assignment business clearer with an example

…se-1b

Signed-off-by: Timothy Seah <tseah@anyscale.com>

…se-1b

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 126a549. Configure here.}

…t#61475) # Summary This PR follows up on ray-project#61156 by handling torchft worker group failure recovery. Here are some of the design decisions: * `ReplicaGroup.shutdown` is similar to `WorkerGroup.shutdown` (they both shut down workers and clear state) but doesn't do some other stuff (e.g. callbacks and placementgroup cleanup). * `WorkerGroup.replace_replica_group` is similar to `WorkerGroup._start_impl` so I refactored their shared functionality accordingly. The main difference is that the former runs fewer callbacks. This PR also changes the semantics of get_world_rank and get_local_rank/get_node_rank: * get_world_rank/get_world_size still apply to all workers (across replica groups). This is because we often need one global rank 0 e.g. DDP rank 0 worker uploads checkpoint. Since we are doing DDP only, this is equivalent to get_replica_group_id. * get_local_rank/get_local_size/get_node_rank now apply to (replica group, node) pairs. One common use case for get_local_rank is to only download data to a node once - but since every replica group gets a different shard of data, every "local rank 0" needs to download data anyway. Also, creating a local_rank across all replica groups on the same node might not be feasible; if there is a node failure that results in a replica group getting scheduled on an existing node with another replica group, we would need to re-sort all the local ranks. Fortunately, every torchft replica group is effectively its own torchrun group (communication between torchrun groups is handled by the `Manager`), so this is consistent with their model. For this reason we don't need to worry about `CUDA_VISIBLE_DEVICES` either. * The order of workers in `WorkerGroupState.workers` is equivalent to `get_world_rank`. This invariant still holds. I also went through every single `WorkerGroupCallback` method and determined whether or not they are relevant for `ReplicaGroups`. In many cases, the behavior is the same for `WorkerGroups` and `ReplicaGroups`, so I also defined corresponding `ExecutionGroupCallback` methods that get called in both cases by default. `before_init_train_context`: Always the same between `WorkerGroup` and `ReplicaGroup` so I moved this behavior up to `ExecutionGroupCallback`. * `AcceleratorSetupCallback`: well handled and tested in this PR * `CheckpointManagerCallback` and `ValidationManagerCallback`: same behavior. Will test this more thoroughly in the followup `ray.train.report` PR. * `DatasetSetupCallback`: will test this more in the followup `data integration` PR. `before_worker_group_shutdown`: sometimes same (in which case the user can override `before_execution_group_shutdown`) sometimes different (in which case the user can override `before_worker_group_shutdown` or `before_replica_group_shutdown`). * `BackendSetupCallback`: implemented in `before_execution_group_shutdown`. Well handled and tested in this PR * `StateManagerCallback`: only implemented with `before_worker_group_shutdown` because replica groups are not reflected in train run state * `ReportCallbackHandler`: will be fixed in followup `ray.train.report` PR. The main idea is that `before_replica_group_shutdown` can also clear report states. `after_worker_group_start`: same as `before_worker_group_shutdown` * `BackendSetupCallback` and `WorkingDirectorySetupCallback`: implemented in `before_execution_group_shutdown`. Well handled and tested in this PR * `DatasetSetupCallback` and `ReportCallbackHandler` will be handled in the aforementioned future PR's * `StateManager` and `PlacementGroupCleanerCallback` only apply to worker groups. `after_worker_group_shutdown` and `after_worker_group_abort` are only used by `DatasetsCallback`. They should only apply to worker groups; when a replica group goes down, rather than shutting down any state, we simply send the state to the replacement worker. Of course, the aforementioned future data integration PR will test this better. All other `WorkerGroupCallback` methods are irrelevant: * `on_worker_group-start/on_worker_group_shutdown` are just for timing the worker group. * `before_worker_group_start` and `before_worker_group_abort` are only used by `StateManagerCallback` which is irrelevant as explained earlier. * `after_worker_group_training_start` is never used. * `after_worker_group_poll_status` is irrelevant because it operates on all the workers in the worker group, while replica groups are just thin wrappers around the workers. # Testing I'm open to more unit test suggestions. I basically tried to unit test different layers of the stack as follows: * `test_torch_trainer`: e2e test. I also verified that it works as expected. It's still disabled until I add torchft dependencies to the train CI. * `test_controller`: tests that we correctly decide when to do a replica group restart or a full worker group restart. Note that this also tests elastic training. * `test_worker_group`: tests that when we `replace_replica_group` we correctly update the relevant state (`WorkerGroupState`, replica groups, polling state). I added `mark.parametrize` to other unit tests to verify other behavior works as expected with both worker group and replica group restarts e.g. callbacks and worker initialization. Successfully ran a driver script in a workspace with simulated node failures (killed the raylet) and confirmed that it worked. Driver script: https://gist.github.com/TimothySeah/35aa96b81b2d98d77c23b10c0baa71c9 Logs Failure detected: https://gist.github.com/TimothySeah/923ce69987de98b774d8de214e6845ae Training stops due to unmet quorum ``` �[36m(LighthouseServerActor pid=24407)�[0m 2026-03-11T13:02:09.398 [INFO] [torchft::lighthouse] - Quorum status: New quorum not ready, only have 2 participants, need min_replicas 3 [2/2 participants healthy][2 heartbeating][shrink_only=false] �[33m(raylet)�[0m The node with node id: 0b015b238c8d9b950c92fe08a188001514a61a624cebdfe0e2aab715 and address: 10.0.82.255 and node name: 10.0.82.255 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload ``` replace_replica_group fails because node is unschedulable: https://gist.github.com/TimothySeah/12030fe7814a72e1b21fd75c25710635 Autoscaling completes and training continues: https://gist.github.com/TimothySeah/98b0776199ff18a97ce64246ab7894c9 Eventually we reach the finished state. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah added 3 commits March 2, 2026 18:32

Main changes. Not unit tests or elastic changes yet.

808022d

Signed-off-by: Timothy Seah <tseah@anyscale.com>

work with elastic training

16ace25

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Unit tests + fix/refactor + fix all replica group failing corner case

9092fcd

Signed-off-by: Timothy Seah <tseah@anyscale.com>

gemini-code-assist Bot reviewed Mar 4, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py

call before_init_train_context on all executiongroupcallbacks

376a5aa

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah marked this pull request as ready for review March 4, 2026 04:11

TimothySeah requested a review from a team as a code owner March 4, 2026 04:11

cursor Bot reviewed Mar 4, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py

ray-gardener Bot added the train Ray Train Related Issue label Mar 4, 2026

handle replace_replica_group failures

470053e

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah commented Mar 7, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated

start_impl sorts workers per replica group, world is global across re…

2e6b0f0

…plica groups but everything else is not Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu reviewed Mar 17, 2026

View reviewed changes

TimothySeah added 6 commits March 17, 2026 18:15

Merge remote-tracking branch 'upstream/master' into tseah/torchft-pha…

ea4d133

…se-1b

rename overloaded bundle

4f9a2ea

Signed-off-by: Timothy Seah <tseah@anyscale.com>

move replica group callbacks to replica group

f7118a9

Signed-off-by: Timothy Seah <tseah@anyscale.com>

simple fixes

0235f58

Signed-off-by: Timothy Seah <tseah@anyscale.com>

remove unnecessary WorldRankToOngoingPoll

bfa6cd8

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Switch order to fix shutdown bug

ed5d43a

Signed-off-by: Timothy Seah <tseah@anyscale.com>

cursor Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/execution/controller/controller.py

TimothySeah added 2 commits March 18, 2026 18:51

add requires_shutdown_before_non_running_decision logic

8e1eece

Signed-off-by: Timothy Seah <tseah@anyscale.com>

erge remote-tracking branch 'upstream/master' into tseah/torchft-phas…

19b4489

…e-1b

cursor Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated

TimothySeah added 2 commits March 20, 2026 11:12

make create_workers for all replica groups atomic

05a8782

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tseah/torchft-pha…

02ec1a3

…se-1b

cursor Bot reviewed Mar 20, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated

fix node rank

6d38bb4

Signed-off-by: Timothy Seah <tseah@anyscale.com>

ndnchapathi69 reviewed Mar 23, 2026

View reviewed changes

Comment thread python/ray/train/v2/tests/test_worker_group.py

justinvyu reviewed Mar 26, 2026

View reviewed changes

TimothySeah added 3 commits April 1, 2026 17:48

Merge remote-tracking branch 'upstream/master' into tseah/torchft-pha…

12bf333

…se-1b

remove requires_shutdown_before_non_running_decision

d4b30a4

Signed-off-by: Timothy Seah <tseah@anyscale.com>

torchft unit test verifies replica group restarts

14f1c2b

Signed-off-by: Timothy Seah <tseah@anyscale.com>

cursor Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread python/ray/train/v2/torch/torchft_config.py

TimothySeah added 2 commits April 2, 2026 13:31

add 1 more test case

217cccd

Signed-off-by: Timothy Seah <tseah@anyscale.com>

address more comments

f1381e4

Signed-off-by: Timothy Seah <tseah@anyscale.com>

cursor Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/execution/callback.py Outdated

TimothySeah added 2 commits April 2, 2026 19:11

Add _assign_worker_ranks test

53b73c5

Signed-off-by: Timothy Seah <tseah@anyscale.com>

address cursor-reported bugs

bfe7808

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from justinvyu April 3, 2026 02:24

justinvyu approved these changes Apr 7, 2026

View reviewed changes

justinvyu enabled auto-merge (squash) April 8, 2026 23:55

github-actions Bot added the go add ONLY when ready to merge, run all tests label Apr 8, 2026

Merge remote-tracking branch 'upstream/master' into tseah/torchft-pha…

54161cb

…se-1b

github-actions Bot disabled auto-merge April 9, 2026 00:53

cursor Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py

address comment + verify still works

faa1c34

Signed-off-by: Timothy Seah <tseah@anyscale.com>

cursor Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py

Comment thread python/ray/train/v2/_internal/execution/controller/controller.py

TimothySeah added 2 commits April 10, 2026 18:29

add todo

457a1c9

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tseah/torchft-pha…

126a549

…se-1b

cursor Bot reviewed Apr 11, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py

matthewdeng merged commit 06580e2 into ray-project:master Apr 14, 2026
6 checks passed

	Ray Data	Ray Core
Fixed training	Easy: move data shard from worker x to its replacement	None
Elastic training	Hard: must reconfigure data shards	Easy option: 1 PlacementGroup per replica group (size 1 for DDP). This is an antipattern. Hard option: Resizable PlacementGroup

Uh oh!

Conversation

TimothySeah commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

TimothySeah left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TimothySeah commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TimothySeah commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

justinvyu Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

4 participants

TimothySeah commented Mar 4, 2026 •

edited

Loading

TimothySeah left a comment •

edited

Loading

TimothySeah commented Mar 7, 2026 •

edited

Loading

TimothySeah commented Mar 11, 2026 •

edited

Loading