Skip to content

[train][torchft] Ray Train manages replica group restarts#61475

Merged
matthewdeng merged 28 commits into
ray-project:masterfrom
TimothySeah:tseah/torchft-phase-1b
Apr 14, 2026
Merged

[train][torchft] Ray Train manages replica group restarts#61475
matthewdeng merged 28 commits into
ray-project:masterfrom
TimothySeah:tseah/torchft-phase-1b

Conversation

@TimothySeah

@TimothySeah TimothySeah commented Mar 4, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR follows up on #61156 by handling torchft worker group failure recovery.

Here are some of the design decisions:

  • ReplicaGroup.shutdown is similar to WorkerGroup.shutdown (they both shut down workers and clear state) but doesn't do some other stuff (e.g. callbacks and placementgroup cleanup).
  • WorkerGroup.replace_replica_group is similar to WorkerGroup._start_impl so I refactored their shared functionality accordingly. The main difference is that the former runs fewer callbacks.

This PR also changes the semantics of get_world_rank and get_local_rank/get_node_rank:

  • get_world_rank/get_world_size still apply to all workers (across replica groups). This is because we often need one global rank 0 e.g. DDP rank 0 worker uploads checkpoint. Since we are doing DDP only, this is equivalent to get_replica_group_id.
  • get_local_rank/get_local_size/get_node_rank now apply to (replica group, node) pairs. One common use case for get_local_rank is to only download data to a node once - but since every replica group gets a different shard of data, every "local rank 0" needs to download data anyway. Also, creating a local_rank across all replica groups on the same node might not be feasible; if there is a node failure that results in a replica group getting scheduled on an existing node with another replica group, we would need to re-sort all the local ranks. Fortunately, every torchft replica group is effectively its own torchrun group (communication between torchrun groups is handled by the Manager), so this is consistent with their model. For this reason we don't need to worry about CUDA_VISIBLE_DEVICES either.
  • The order of workers in WorkerGroupState.workers is equivalent to get_world_rank. This invariant still holds.

I also went through every single WorkerGroupCallback method and determined whether or not they are relevant for ReplicaGroups. In many cases, the behavior is the same for WorkerGroups and ReplicaGroups, so I also defined corresponding ExecutionGroupCallback methods that get called in both cases by default.

before_init_train_context: Always the same between WorkerGroup and ReplicaGroup so I moved this behavior up to ExecutionGroupCallback.

  • AcceleratorSetupCallback: well handled and tested in this PR
  • CheckpointManagerCallback and ValidationManagerCallback: same behavior. Will test this more thoroughly in the followup ray.train.report PR.
  • DatasetSetupCallback: will test this more in the followup data integration PR.

before_worker_group_shutdown: sometimes same (in which case the user can override before_execution_group_shutdown) sometimes different (in which case the user can override before_worker_group_shutdown or before_replica_group_shutdown).

  • BackendSetupCallback: implemented in before_execution_group_shutdown. Well handled and tested in this PR
  • StateManagerCallback: only implemented with before_worker_group_shutdown because replica groups are not reflected in train run state
  • ReportCallbackHandler: will be fixed in followup ray.train.report PR. The main idea is that before_replica_group_shutdown can also clear report states.

after_worker_group_start: same as before_worker_group_shutdown

  • BackendSetupCallback and WorkingDirectorySetupCallback: implemented in before_execution_group_shutdown. Well handled and tested in this PR
  • DatasetSetupCallback and ReportCallbackHandler will be handled in the aforementioned future PR's
  • StateManager and PlacementGroupCleanerCallback only apply to worker groups.

after_worker_group_shutdown and after_worker_group_abort are only used by DatasetsCallback. They should only apply to worker groups; when a replica group goes down, rather than shutting down any state, we simply send the state to the replacement worker. Of course, the aforementioned future data integration PR will test this better.

All other WorkerGroupCallback methods are irrelevant:

  • on_worker_group-start/on_worker_group_shutdown are just for timing the worker group.
  • before_worker_group_start and before_worker_group_abort are only used by StateManagerCallback which is irrelevant as explained earlier.
  • after_worker_group_training_start is never used.
  • after_worker_group_poll_status is irrelevant because it operates on all the workers in the worker group, while replica groups are just thin wrappers around the workers.

Testing

I'm open to more unit test suggestions. I basically tried to unit test different layers of the stack as follows:

  • test_torch_trainer: e2e test. I also verified that it works as expected. It's still disabled until I add torchft dependencies to the train CI.
  • test_controller: tests that we correctly decide when to do a replica group restart or a full worker group restart. Note that this also tests elastic training.
  • test_worker_group: tests that when we replace_replica_group we correctly update the relevant state (WorkerGroupState, replica groups, polling state). I added mark.parametrize to other unit tests to verify other behavior works as expected with both worker group and replica group restarts e.g. callbacks and worker initialization.

Successfully ran a driver script in a workspace with simulated node failures (killed the raylet) and confirmed that it worked.

Driver script: https://gist.github.com/TimothySeah/35aa96b81b2d98d77c23b10c0baa71c9

Logs

Failure detected: https://gist.github.com/TimothySeah/923ce69987de98b774d8de214e6845ae

Training stops due to unmet quorum

�[36m(LighthouseServerActor pid=24407)�[0m 2026-03-11T13:02:09.398 [INFO] [torchft::lighthouse] - Quorum status: New quorum not ready, only have 2 participants, need min_replicas 3 [2/2 participants healthy][2 heartbeating][shrink_only=false]
�[33m(raylet)�[0m The node with node id: 0b015b238c8d9b950c92fe08a188001514a61a624cebdfe0e2aab715 and address: 10.0.82.255 and node name: 10.0.82.255 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a 	(1) raylet crashes unexpectedly (OOM, etc.) 
	(2) raylet has lagging heartbeats due to slow network or busy workload

replace_replica_group fails because node is unschedulable: https://gist.github.com/TimothySeah/12030fe7814a72e1b21fd75c25710635

Autoscaling completes and training continues: https://gist.github.com/TimothySeah/98b0776199ff18a97ce64246ab7894c9

Eventually we reach the finished state.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for replica group restarts, a key feature for fault tolerance with torchft. The changes are well-structured, introducing ExecutionGroup and ExecutionGroupCallback as base classes to share logic between WorkerGroup and the new ReplicaGroup concepts. The controller logic is updated to handle partial restarts of failing replica groups, and the WorkerGroup is enhanced with a replace_replica_group method. The refactoring is clean and the new functionality is supported by a comprehensive set of tests. I have one suggestion regarding the callback handling to make it more robust for custom callbacks.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah marked this pull request as ready for review March 4, 2026 04:11
@TimothySeah TimothySeah requested a review from a team as a code owner March 4, 2026 04:11
@ray-gardener ray-gardener Bot added the train Ray Train Related Issue label Mar 4, 2026
Signed-off-by: Timothy Seah <tseah@anyscale.com>

@TimothySeah TimothySeah left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should hold off on torchft-native elastic training in this PR. Consider the core and data dependencies below:

  Ray Data Ray Core
Fixed training Easy: move data shard from worker x to its replacement None
Elastic training Hard: must reconfigure data shards Easy option: 1 PlacementGroup per replica group (size 1 for DDP). This is an antipattern. Hard option: Resizable PlacementGroup

Because it’s easy to get Ray Data + Ray Train + torchft to work for fixed training - and elastic training still works because we can just fall back to worker group restarts - I would suggest addressing this in a future PR.

Instead of having 1 WorkerGroup with many ReplicaGroups, should we just have multiple WorkerGroups? I prefer 1 WorkerGroup with many ReplicaGroups because:

  • Single controller single layer: We want to simultaneously poll all the workers. Similarly, we want to have a single SyncActor across all the workers to enable a report barrier. Having multiple WorkerGroups could complicate this.
  • WorkerGroup and ReplicaGroup have different callbacks. For example, WorkerGroup.on_start should set up one process group per replica group.
  • 1 WorkerGroup also translates more naturally to the Ray Train dashboard - showing 1 WorkerGroup per replica group in the UI would add unnecessary detail and complexity.
  • It makes sense to have ReplicaGroup as a first class concept because it is data parallel group and it is intuitive to shard 1 dataset across the data parallel groups of 1 workergroup.
Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated
@TimothySeah

TimothySeah commented Mar 7, 2026

Copy link
Copy Markdown
Contributor Author

Question: how will state transitions look?
Answer: TrainRun transitions will be the same (RUNNING → RESTARTING → SCHEDULING → RUNNING) but there will be no TrainRunAttempt.

…plica groups but everything else is not

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah

TimothySeah commented Mar 11, 2026

Copy link
Copy Markdown
Contributor Author

Sanity check audit of places that use get_local_rank

  • Set env vars: torch, torch xla. Not an issue: every replica group is independent
  • Raytrainreportcallback lightning: only local rank 0 clears checkpoint directory. Not an issue: torchft team has not documented torchft + lightning, in the worst case 2 workers deleting the same directory isn’t a big deal. Separate issue: 2 workers write to same checkpoint right now
  • Prepare_model. Not an issue: just toggles printing. After this change, we do get_devices()[0] where get_devices() will only return 1 device (e.g. 2), as opposed to get_devices()[2].
  • Resent.ipynb and pytorch_resent_finetune.ipynb: download dataset once per node. Not an issue: every replica group has different data, and we should use ray data splitting instead
  • Persistent-storage.rst: every local rank 0 worker saves artifacts. Not an issue: every replica group should produce different artifacts

@justinvyu justinvyu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a few more things to look at and the tests, but here's a few comments for now

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated
Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated
Comment thread python/ray/train/v2/torch/torchft_config.py Outdated
Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated
Comment thread python/ray/train/v2/_internal/execution/worker_group/poll.py Outdated
Comment thread python/ray/train/v2/_internal/execution/worker_group/execution_group.py Outdated
Comment thread python/ray/train/v2/torch/torchft_config.py Outdated
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Comment thread python/ray/train/v2/_internal/execution/controller/controller.py
Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated
Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py Outdated
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Comment thread python/ray/train/v2/tests/test_worker_group.py

@justinvyu justinvyu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few high level thoughts on the design. These are non-blocking but we can discuss offline

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker.py Outdated
Comment thread python/ray/train/v2/_internal/execution/scaling_policy/elastic.py Outdated
Comment thread python/ray/train/v2/_internal/execution/controller/controller.py
Comment thread python/ray/train/v2/_internal/execution/callback.py
Comment thread python/ray/train/v2/torch/torchft_config.py
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Comment thread python/ray/train/v2/_internal/execution/callback.py Outdated
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from justinvyu April 3, 2026 02:24

@justinvyu justinvyu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Comment thread python/ray/train/v2/_internal/execution/controller/controller.py
Comment on lines +1012 to +1016
# After sorting: node0/gpu0, node0/gpu1, node1/gpu0, node1/gpu1
# Each worker is its own replica group, so local_rank=0,
# local_world_size=1, node_rank=0 for all.
[
DistributedContext(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the test, makes the node assignment business clearer with an example

@justinvyu justinvyu enabled auto-merge (squash) April 8, 2026 23:55
@github-actions github-actions Bot added the go add ONLY when ready to merge, run all tests label Apr 8, 2026
@github-actions github-actions Bot disabled auto-merge April 9, 2026 00:53
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Comment thread python/ray/train/v2/_internal/execution/controller/controller.py
Signed-off-by: Timothy Seah <tseah@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 126a549. Configure here.

@matthewdeng matthewdeng merged commit 06580e2 into ray-project:master Apr 14, 2026
6 checks passed
HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026
…t#61475)

# Summary

This PR follows up on ray-project#61156 by
handling torchft worker group failure recovery.

Here are some of the design decisions:
* `ReplicaGroup.shutdown` is similar to `WorkerGroup.shutdown` (they
both shut down workers and clear state) but doesn't do some other stuff
(e.g. callbacks and placementgroup cleanup).
* `WorkerGroup.replace_replica_group` is similar to
`WorkerGroup._start_impl` so I refactored their shared functionality
accordingly. The main difference is that the former runs fewer
callbacks.

This PR also changes the semantics of get_world_rank and
get_local_rank/get_node_rank:
* get_world_rank/get_world_size still apply to all workers (across
replica groups). This is because we often need one global rank 0 e.g.
DDP rank 0 worker uploads checkpoint. Since we are doing DDP only, this
is equivalent to get_replica_group_id.
* get_local_rank/get_local_size/get_node_rank now apply to (replica
group, node) pairs. One common use case for get_local_rank is to only
download data to a node once - but since every replica group gets a
different shard of data, every "local rank 0" needs to download data
anyway. Also, creating a local_rank across all replica groups on the
same node might not be feasible; if there is a node failure that results
in a replica group getting scheduled on an existing node with another
replica group, we would need to re-sort all the local ranks.
Fortunately, every torchft replica group is effectively its own torchrun
group (communication between torchrun groups is handled by the
`Manager`), so this is consistent with their model. For this reason we
don't need to worry about `CUDA_VISIBLE_DEVICES` either.
* The order of workers in `WorkerGroupState.workers` is equivalent to
`get_world_rank`. This invariant still holds.

I also went through every single `WorkerGroupCallback` method and
determined whether or not they are relevant for `ReplicaGroups`. In many
cases, the behavior is the same for `WorkerGroups` and `ReplicaGroups`,
so I also defined corresponding `ExecutionGroupCallback` methods that
get called in both cases by default.

`before_init_train_context`: Always the same between `WorkerGroup` and
`ReplicaGroup` so I moved this behavior up to `ExecutionGroupCallback`.
* `AcceleratorSetupCallback`: well handled and tested in this PR
* `CheckpointManagerCallback` and `ValidationManagerCallback`: same
behavior. Will test this more thoroughly in the followup
`ray.train.report` PR.
* `DatasetSetupCallback`: will test this more in the followup `data
integration` PR.

`before_worker_group_shutdown`: sometimes same (in which case the user
can override `before_execution_group_shutdown`) sometimes different (in
which case the user can override `before_worker_group_shutdown` or
`before_replica_group_shutdown`).
* `BackendSetupCallback`: implemented in
`before_execution_group_shutdown`. Well handled and tested in this PR
* `StateManagerCallback`: only implemented with
`before_worker_group_shutdown` because replica groups are not reflected
in train run state
* `ReportCallbackHandler`: will be fixed in followup `ray.train.report`
PR. The main idea is that `before_replica_group_shutdown` can also clear
report states.

`after_worker_group_start`: same as `before_worker_group_shutdown`
* `BackendSetupCallback` and `WorkingDirectorySetupCallback`:
implemented in `before_execution_group_shutdown`. Well handled and
tested in this PR
* `DatasetSetupCallback` and `ReportCallbackHandler` will be handled in
the aforementioned future PR's
* `StateManager` and `PlacementGroupCleanerCallback` only apply to
worker groups.

`after_worker_group_shutdown` and `after_worker_group_abort` are only
used by `DatasetsCallback`. They should only apply to worker groups;
when a replica group goes down, rather than shutting down any state, we
simply send the state to the replacement worker. Of course, the
aforementioned future data integration PR will test this better.

All other `WorkerGroupCallback` methods are irrelevant:
* `on_worker_group-start/on_worker_group_shutdown` are just for timing
the worker group.
* `before_worker_group_start` and `before_worker_group_abort` are only
used by `StateManagerCallback` which is irrelevant as explained earlier.
* `after_worker_group_training_start` is never used.
* `after_worker_group_poll_status` is irrelevant because it operates on
all the workers in the worker group, while replica groups are just thin
wrappers around the workers.

# Testing

I'm open to more unit test suggestions. I basically tried to unit test
different layers of the stack as follows:
* `test_torch_trainer`: e2e test. I also verified that it works as
expected. It's still disabled until I add torchft dependencies to the
train CI.
* `test_controller`: tests that we correctly decide when to do a replica
group restart or a full worker group restart. Note that this also tests
elastic training.
* `test_worker_group`: tests that when we `replace_replica_group` we
correctly update the relevant state (`WorkerGroupState`, replica groups,
polling state). I added `mark.parametrize` to other unit tests to verify
other behavior works as expected with both worker group and replica
group restarts e.g. callbacks and worker initialization.

Successfully ran a driver script in a workspace with simulated node
failures (killed the raylet) and confirmed that it worked.

Driver script:
https://gist.github.com/TimothySeah/35aa96b81b2d98d77c23b10c0baa71c9

Logs 

Failure detected:
https://gist.github.com/TimothySeah/923ce69987de98b774d8de214e6845ae

Training stops due to unmet quorum

```
�[36m(LighthouseServerActor pid=24407)�[0m 2026-03-11T13:02:09.398 [INFO] [torchft::lighthouse] - Quorum status: New quorum not ready, only have 2 participants, need min_replicas 3 [2/2 participants healthy][2 heartbeating][shrink_only=false]
�[33m(raylet)�[0m The node with node id: 0b015b238c8d9b950c92fe08a188001514a61a624cebdfe0e2aab715 and address: 10.0.82.255 and node name: 10.0.82.255 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a 	(1) raylet crashes unexpectedly (OOM, etc.) 
	(2) raylet has lagging heartbeats due to slow network or busy workload
```

replace_replica_group fails because node is unschedulable:
https://gist.github.com/TimothySeah/12030fe7814a72e1b21fd75c25710635


Autoscaling completes and training continues:
https://gist.github.com/TimothySeah/98b0776199ff18a97ce64246ab7894c9

Eventually we reach the finished state.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…t#61475)

# Summary

This PR follows up on ray-project#61156 by
handling torchft worker group failure recovery.

Here are some of the design decisions:
* `ReplicaGroup.shutdown` is similar to `WorkerGroup.shutdown` (they
both shut down workers and clear state) but doesn't do some other stuff
(e.g. callbacks and placementgroup cleanup).
* `WorkerGroup.replace_replica_group` is similar to
`WorkerGroup._start_impl` so I refactored their shared functionality
accordingly. The main difference is that the former runs fewer
callbacks.

This PR also changes the semantics of get_world_rank and
get_local_rank/get_node_rank:
* get_world_rank/get_world_size still apply to all workers (across
replica groups). This is because we often need one global rank 0 e.g.
DDP rank 0 worker uploads checkpoint. Since we are doing DDP only, this
is equivalent to get_replica_group_id.
* get_local_rank/get_local_size/get_node_rank now apply to (replica
group, node) pairs. One common use case for get_local_rank is to only
download data to a node once - but since every replica group gets a
different shard of data, every "local rank 0" needs to download data
anyway. Also, creating a local_rank across all replica groups on the
same node might not be feasible; if there is a node failure that results
in a replica group getting scheduled on an existing node with another
replica group, we would need to re-sort all the local ranks.
Fortunately, every torchft replica group is effectively its own torchrun
group (communication between torchrun groups is handled by the
`Manager`), so this is consistent with their model. For this reason we
don't need to worry about `CUDA_VISIBLE_DEVICES` either.
* The order of workers in `WorkerGroupState.workers` is equivalent to
`get_world_rank`. This invariant still holds.

I also went through every single `WorkerGroupCallback` method and
determined whether or not they are relevant for `ReplicaGroups`. In many
cases, the behavior is the same for `WorkerGroups` and `ReplicaGroups`,
so I also defined corresponding `ExecutionGroupCallback` methods that
get called in both cases by default.

`before_init_train_context`: Always the same between `WorkerGroup` and
`ReplicaGroup` so I moved this behavior up to `ExecutionGroupCallback`.
* `AcceleratorSetupCallback`: well handled and tested in this PR
* `CheckpointManagerCallback` and `ValidationManagerCallback`: same
behavior. Will test this more thoroughly in the followup
`ray.train.report` PR.
* `DatasetSetupCallback`: will test this more in the followup `data
integration` PR.

`before_worker_group_shutdown`: sometimes same (in which case the user
can override `before_execution_group_shutdown`) sometimes different (in
which case the user can override `before_worker_group_shutdown` or
`before_replica_group_shutdown`).
* `BackendSetupCallback`: implemented in
`before_execution_group_shutdown`. Well handled and tested in this PR
* `StateManagerCallback`: only implemented with
`before_worker_group_shutdown` because replica groups are not reflected
in train run state
* `ReportCallbackHandler`: will be fixed in followup `ray.train.report`
PR. The main idea is that `before_replica_group_shutdown` can also clear
report states.

`after_worker_group_start`: same as `before_worker_group_shutdown`
* `BackendSetupCallback` and `WorkingDirectorySetupCallback`:
implemented in `before_execution_group_shutdown`. Well handled and
tested in this PR
* `DatasetSetupCallback` and `ReportCallbackHandler` will be handled in
the aforementioned future PR's
* `StateManager` and `PlacementGroupCleanerCallback` only apply to
worker groups.

`after_worker_group_shutdown` and `after_worker_group_abort` are only
used by `DatasetsCallback`. They should only apply to worker groups;
when a replica group goes down, rather than shutting down any state, we
simply send the state to the replacement worker. Of course, the
aforementioned future data integration PR will test this better.

All other `WorkerGroupCallback` methods are irrelevant:
* `on_worker_group-start/on_worker_group_shutdown` are just for timing
the worker group.
* `before_worker_group_start` and `before_worker_group_abort` are only
used by `StateManagerCallback` which is irrelevant as explained earlier.
* `after_worker_group_training_start` is never used.
* `after_worker_group_poll_status` is irrelevant because it operates on
all the workers in the worker group, while replica groups are just thin
wrappers around the workers.

# Testing

I'm open to more unit test suggestions. I basically tried to unit test
different layers of the stack as follows:
* `test_torch_trainer`: e2e test. I also verified that it works as
expected. It's still disabled until I add torchft dependencies to the
train CI.
* `test_controller`: tests that we correctly decide when to do a replica
group restart or a full worker group restart. Note that this also tests
elastic training.
* `test_worker_group`: tests that when we `replace_replica_group` we
correctly update the relevant state (`WorkerGroupState`, replica groups,
polling state). I added `mark.parametrize` to other unit tests to verify
other behavior works as expected with both worker group and replica
group restarts e.g. callbacks and worker initialization.

Successfully ran a driver script in a workspace with simulated node
failures (killed the raylet) and confirmed that it worked.

Driver script:
https://gist.github.com/TimothySeah/35aa96b81b2d98d77c23b10c0baa71c9

Logs 

Failure detected:
https://gist.github.com/TimothySeah/923ce69987de98b774d8de214e6845ae

Training stops due to unmet quorum

```
�[36m(LighthouseServerActor pid=24407)�[0m 2026-03-11T13:02:09.398 [INFO] [torchft::lighthouse] - Quorum status: New quorum not ready, only have 2 participants, need min_replicas 3 [2/2 participants healthy][2 heartbeating][shrink_only=false]
�[33m(raylet)�[0m The node with node id: 0b015b238c8d9b950c92fe08a188001514a61a624cebdfe0e2aab715 and address: 10.0.82.255 and node name: 10.0.82.255 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a 	(1) raylet crashes unexpectedly (OOM, etc.) 
	(2) raylet has lagging heartbeats due to slow network or busy workload
```

replace_replica_group fails because node is unschedulable:
https://gist.github.com/TimothySeah/12030fe7814a72e1b21fd75c25710635


Autoscaling completes and training continues:
https://gist.github.com/TimothySeah/98b0776199ff18a97ce64246ab7894c9

Eventually we reach the finished state.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

4 participants