[Train] On train initialization block until create_or_update_train_run is complete#63432
Conversation
Signed-off-by: Mark Towers <mark@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request modifies state_manager.py to block during the INITIALIZING status of a training run, ensuring the run is recorded before any attempts occur to prevent race conditions. A minor whitespace adjustment was also made. I have no feedback to provide.
| if run.status.is_terminal(): | ||
| # Block on INITIALIZING to ensure ExportTrainRun is recorded before any ExportTrainRunAttempt | ||
| # Block on terminal status so the final state isn't lost if the controller exits right after. | ||
| if run.status == RunStatus.INITIALIZING or run.status.is_terminal(): |
There was a problem hiding this comment.
I think we should move this upstream to the callsite, e.g. create_train_run. Gets a little convoluted if we check specific statuses here.
There was a problem hiding this comment.
Agreed, I've added a "block" argument such that callers need to explicitly specify they want to result ray.get blocked
Signed-off-by: Mark Towers <mark@anyscale.com>
matthewdeng
left a comment
There was a problem hiding this comment.
Code looks good, could you add unit test?
Yes, I've added a tests for both initiailisation and termination using a state actor that gates until the training run is kicked off |
Description
In Ray Train, we produce
ExportTrainRunwhich is a prerequisite for eachExportTrainRunAttempthowever its possible that due a race condition thenExportTrainRunmight not run before the user code fails or theExportTrainRunAttemptis flushed beforeExportTrainRun.Therefore, this PR adds a block to ensure that
ExportTrainRunis flushed before we move onto the next stage. This should avoid problems downstream on Anyscale.