Skip to content

[Train] On train initialization block until create_or_update_train_run is complete#63432

Merged
matthewdeng merged 3 commits into
ray-project:masterfrom
pseudo-rnd-thoughts:block-train-run-initialization
May 22, 2026
Merged

[Train] On train initialization block until create_or_update_train_run is complete#63432
matthewdeng merged 3 commits into
ray-project:masterfrom
pseudo-rnd-thoughts:block-train-run-initialization

Conversation

@pseudo-rnd-thoughts

@pseudo-rnd-thoughts pseudo-rnd-thoughts commented May 18, 2026

Copy link
Copy Markdown
Member

Description

In Ray Train, we produce ExportTrainRun which is a prerequisite for each ExportTrainRunAttempt however its possible that due a race condition then ExportTrainRun might not run before the user code fails or the ExportTrainRunAttempt is flushed before ExportTrainRun.

Therefore, this PR adds a block to ensure that ExportTrainRun is flushed before we move onto the next stage. This should avoid problems downstream on Anyscale.

Signed-off-by: Mark Towers <mark@anyscale.com>
@pseudo-rnd-thoughts pseudo-rnd-thoughts requested a review from a team as a code owner May 18, 2026 14:03
@pseudo-rnd-thoughts pseudo-rnd-thoughts added the train Ray Train Related Issue label May 18, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies state_manager.py to block during the INITIALIZING status of a training run, ensuring the run is recorded before any attempts occur to prevent race conditions. A minor whitespace adjustment was also made. I have no feedback to provide.

@pseudo-rnd-thoughts pseudo-rnd-thoughts added the go add ONLY when ready to merge, run all tests label May 18, 2026
if run.status.is_terminal():
# Block on INITIALIZING to ensure ExportTrainRun is recorded before any ExportTrainRunAttempt
# Block on terminal status so the final state isn't lost if the controller exits right after.
if run.status == RunStatus.INITIALIZING or run.status.is_terminal():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this upstream to the callsite, e.g. create_train_run. Gets a little convoluted if we check specific statuses here.

@pseudo-rnd-thoughts pseudo-rnd-thoughts May 19, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I've added a "block" argument such that callers need to explicitly specify they want to result ray.get blocked

Signed-off-by: Mark Towers <mark@anyscale.com>

@matthewdeng matthewdeng left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, could you add unit test?

Signed-off-by: Mark Towers <mark@anyscale.com>
@pseudo-rnd-thoughts

Copy link
Copy Markdown
Member Author

could you add unit test?

Yes, I've added a tests for both initiailisation and termination using a state actor that gates until the training run is kicked off

@matthewdeng matthewdeng merged commit d759784 into ray-project:master May 22, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

2 participants