[train] Improve JaxTrainer TPU multi-slice fault tolerance and reservation ergonomics by liulehui · Pull Request #62893 · ray-project/ray

liulehui · 2026-04-23T22:45:19Z

Description

Fix JAX hangs after preemption recovery due to stale MEGASCALE_* env vars.

When one slice in an N-slice topology is preempted, the underlying TPU provider may inject stale MEGASCALE_NUM_SLICES / MEGASCALE_SLICE_ID / MEGASCALE_COORDINATOR_ADDRESS env vars on the replacement pods (e.g. reporting MEGASCALE_NUM_SLICES=3 for a 2-slice job because terminating pods were still counted). jax.distributed.initialize() then hangs forever waiting for the third slice that never appears.

Ray Train already has the authoritative view of the live worker group, so this PR makes _JaxBackend._setup_jax_distributed_environment always override the four MEGASCALE_* keys with values computed from the current worker group, regardless of what the provider stamped onto the pod. This decouples Ray Train from the provider's view and turns a previously fatal preemption recovery into a clean restart.

TPU SlicePlacementGroup reservation failures are not retried
The CPU/GPU and TPU paths in WorkerGroup._create_placement_group handle "placement group can't be satisfied right now" inconsistently:

Path	Timeout outcome
CPU/GPU	`pg_handle.wait(timeout)` raises `WorkerGroupStartupTimeoutError` → retryable; controller goes `SCHEDULING -> RESCHEDULING`
TPU (before this PR)	`SlicePlacementGroup(...)` blocks synchronously in its constructor; on timeout the catch-all wraps it in a plain `ValueError` → not retryable → run errors out

So if the autoscaler is still bringing up a TPU slice when the 100s head reservation deadline elapses, the run fails immediately instead of retrying. Thus, we translates TimeoutError from the TPU head reservation into the standard WorkerGroupStartupTimeoutError so it flows through the existing retry machinery, translates other unexpected exceptions into WorkerGroupStartupFailedError matching the precedent set by the worker-actor startup path
(RayActorError -> WorkerGroupStartupFailedError).

Test

Tested on Anyscale platform, see example logs: https://gist.github.com/liulehui/0bfb32d1db4d317e1694290fe1290850

gemini-code-assist

Code Review

This pull request introduces a configurable timeout for TPU slice reservations and ensures that JAX multi-slice environment variables are authoritatively overridden to prevent initialization hangs. It also includes a new test case to verify the environment variable overrides. Feedback was provided to move an inline import to the top level for better PEP 8 compliance.

Signed-off-by: Lehui Liu <lehui@anyscale.com>

ryanaoleary · 2026-04-28T19:30:12Z

            env_vars = {}
            if num_slices > 1:
                slice_id = min(i // workers_per_slice, num_slices - 1)
                env_vars = get_tpu_coordinator_env_vars(


We current don't pass the master port through so it's always defaulting to 8081 now. Previously it would remain whatever value the user/TPU webhook set. Should this now be:

env_vars = get_tpu_coordinator_env_vars( coordinator_address=master_addr, num_slices=num_slices, slice_id=slice_id, coordinator_port=str(master_port), # use the dynamic value from controller )

remove from list, I think we will need a different port value given that this port is used by multislice DCN while the master port is for jax distributed coordinator.

ryanaoleary · 2026-04-28T19:31:24Z

Left a couple comments related to how we handle MEGASCALE_PORT. I think using the coordinator calculated value should be fine since I believe it will guarantee a unique port. Everything else in the PR LGTM so will approve once the port comments are addressed.

Signed-off-by: Lehui Liu <lehui@anyscale.com>

ryanaoleary

LGTM!!

Signed-off-by: Lehui Liu <lehui@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 6387910. Configure here.}

Signed-off-by: Lehui Liu <lehui@anyscale.com>

…ation ergonomics (ray-project#62893) ## Description 1. Fix JAX hangs after preemption recovery due to stale `MEGASCALE_*` env vars. When one slice in an N-slice topology is preempted, the underlying TPU provider may inject stale `MEGASCALE_NUM_SLICES` / `MEGASCALE_SLICE_ID` / `MEGASCALE_COORDINATOR_ADDRESS` env vars on the replacement pods (e.g. reporting `MEGASCALE_NUM_SLICES=3` for a 2-slice job because terminating pods were still counted). `jax.distributed.initialize()` then hangs forever waiting for the third slice that never appears. Ray Train already has the authoritative view of the live worker group, so this PR makes `_JaxBackend._setup_jax_distributed_environment` **always** override the four `MEGASCALE_*` keys with values computed from the current worker group, regardless of what the provider stamped onto the pod. This decouples Ray Train from the provider's view and turns a previously fatal preemption recovery into a clean restart. 2. TPU SlicePlacementGroup reservation failures are not retried The CPU/GPU and TPU paths in `WorkerGroup._create_placement_group` handle "placement group can't be satisfied right now" inconsistently: | Path | Timeout outcome | |---|---| | CPU/GPU | `pg_handle.wait(timeout)` raises `WorkerGroupStartupTimeoutError` → retryable; controller goes `SCHEDULING -> RESCHEDULING` | | TPU (before this PR) | `SlicePlacementGroup(...)` blocks synchronously in its constructor; on timeout the catch-all wraps it in a plain `ValueError` → **not retryable** → run errors out | So if the autoscaler is still bringing up a TPU slice when the 100s head reservation deadline elapses, the run fails immediately instead of retrying. Thus, we translates `TimeoutError` from the TPU head reservation into the standard `WorkerGroupStartupTimeoutError` so it flows through the existing retry machinery, translates other unexpected exceptions into `WorkerGroupStartupFailedError` matching the precedent set by the worker-actor startup path (`RayActorError -> WorkerGroupStartupFailedError`). ## Test Tested on Anyscale platform, see example logs: https://gist.github.com/liulehui/0bfb32d1db4d317e1694290fe1290850 --------- Signed-off-by: Lehui Liu <lehui@anyscale.com>

liulehui changed the title ~~[train][jax] Jaxtrainer~~ Apr 23, 2026

liulehui changed the title ~~[train][jax] Jaxtrainer multislice ft~~ Apr 23, 2026

liulehui added the go add ONLY when ready to merge, run all tests label Apr 23, 2026

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread python/ray/_private/accelerators/tpu.py Outdated

liulehui added 3 commits April 27, 2026 14:10

address multislice reservation

6974c84

Signed-off-by: Lehui Liu <lehui@anyscale.com>

make megascale env vars configurable

d0f1ded

Signed-off-by: Lehui Liu <lehui@anyscale.com>

tpu changes

71d03b2

Signed-off-by: Lehui Liu <lehui@anyscale.com>

liulehui force-pushed the jaxtrainer branch from 88d428d to 71d03b2 Compare April 27, 2026 21:51

liulehui changed the title ~~[train][jax] lehui/Jaxtrainer multislice ft~~ Apr 27, 2026

fix comment

1350500

Signed-off-by: Lehui Liu <lehui@anyscale.com>

liulehui marked this pull request as ready for review April 27, 2026 22:06

liulehui requested review from a team as code owners April 27, 2026 22:06

liulehui requested a review from ryanaoleary April 27, 2026 22:07

fix doc

8971542

Signed-off-by: Lehui Liu <lehui@anyscale.com>

ray-gardener Bot added the train Ray Train Related Issue label Apr 28, 2026

ryanaoleary reviewed Apr 28, 2026

View reviewed changes

Comment thread python/ray/train/v2/tests/test_jax_trainer.py

ryanaoleary reviewed Apr 28, 2026

View reviewed changes

Comment thread python/ray/train/v2/jax/config.py Outdated

ryanaoleary reviewed Apr 28, 2026

View reviewed changes

remove megascale port from jax multislice env vars

1ac8861

Signed-off-by: Lehui Liu <lehui@anyscale.com>

cursor Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread python/ray/train/v2/jax/config.py

ryanaoleary approved these changes Apr 28, 2026

View reviewed changes

matthewdeng approved these changes Apr 29, 2026

View reviewed changes

fix flaky java test

6387910

Signed-off-by: Lehui Liu <lehui@anyscale.com>

liulehui force-pushed the jaxtrainer branch from ead62a0 to 6387910 Compare May 4, 2026 18:20

liulehui requested review from SongGuyang, WangTaoTheTonic, kfstorm and raulchen as code owners May 4, 2026 18:20

cursor Bot reviewed May 4, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker_group.py

liulehui added 3 commits May 4, 2026 12:54

fix flaky java test

4e25f1b

Signed-off-by: Lehui Liu <lehui@anyscale.com>

fix flaky java test

f55832c

Signed-off-by: Lehui Liu <lehui@anyscale.com>

Merge branch 'master' into jaxtrainer

45d6341

edoakes approved these changes May 5, 2026

View reviewed changes

matthewdeng merged commit 2094b3e into ray-project:master May 5, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[train] Improve JaxTrainer TPU multi-slice fault tolerance and reservation ergonomics#62893

[train] Improve JaxTrainer TPU multi-slice fault tolerance and reservation ergonomics#62893
matthewdeng merged 10 commits into
ray-project:masterfrom
liulehui:jaxtrainer

liulehui commented Apr 23, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

ryanaoleary Apr 28, 2026

liulehui Apr 28, 2026

ryanaoleary commented Apr 28, 2026

Uh oh!

ryanaoleary left a comment

cursor Bot left a comment

Uh oh!

Uh oh!

Labels

4 participants

Uh oh!

Conversation

liulehui commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ryanaoleary Apr 28, 2026

Choose a reason for hiding this comment

liulehui Apr 28, 2026

Choose a reason for hiding this comment

ryanaoleary commented Apr 28, 2026

Uh oh!

ryanaoleary left a comment

Choose a reason for hiding this comment

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

4 participants

liulehui commented Apr 23, 2026 •

edited

Loading