Skip to content

[tune] Component notification on node failure + Tests#3414

Merged
richardliaw merged 14 commits into
ray-project:masterfrom
richardliaw:tune_cluster-2a
Dec 4, 2018
Merged

[tune] Component notification on node failure + Tests#3414
richardliaw merged 14 commits into
ray-project:masterfrom
richardliaw:tune_cluster-2a

Conversation

@richardliaw

@richardliaw richardliaw commented Nov 27, 2018

Copy link
Copy Markdown
Contributor

Changes include:

  • Notify Components on Requeue
  • Slight refactoring of Node Failure handling
  • Better tests

This is a subset of changes of #3309, so this should go in before.

TODO:

  • Add one more test for try_recover
@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9623/
Test FAILed.

from ray.tune.suggest import BasicVariantGenerator


def register_test_trainable():

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in favor of __fake

node = nodes.pop()
cluster.remove_node(node)
assert cluster.wait_for_nodes()
assert ray.global_state.cluster_resources()["CPU"] == 1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test previously didn't test Tune's resource tracking - updated test

trial_executor.start_trial(trial)
except Exception as e:
self.assertIn("a class", str(e))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test actually didn't actually work because start_trial didn't throw; I rewrote this test and moved it to ray_trial_executor.py.

self.start_trial(trial)
else:
trial.status = Trial.PENDING

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to trial_runner.try_recover so for better handling and ability to notify other components.

@richardliaw richardliaw changed the title [tune] Refactor Node FT + Tests Nov 27, 2018
@richardliaw richardliaw requested a review from ericl November 27, 2018 05:10
@richardliaw richardliaw changed the title [tune] Node FT for components + Tests Nov 27, 2018
@richardliaw richardliaw mentioned this pull request Nov 27, 2018
2 tasks
@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9626/
Test FAILed.

ericl
ericl previously requested changes Nov 29, 2018
Comment thread python/ray/tune/test/ray_trial_executor_test.py Outdated
Comment thread python/ray/tune/trial_executor.py Outdated
@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9671/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9670/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9672/
Test FAILed.

@richardliaw richardliaw merged commit 9d0bd50 into ray-project:master Dec 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants