Skip to content

[autoscaler] Add an initial_workers option#3530

Merged
richardliaw merged 3 commits into
ray-project:masterfrom
mattearllongshot:me/initial-workers
Jan 6, 2019
Merged

[autoscaler] Add an initial_workers option#3530
richardliaw merged 3 commits into
ray-project:masterfrom
mattearllongshot:me/initial-workers

Conversation

@mattearllongshot

Copy link
Copy Markdown
Contributor

What do these changes do?

This option goes along with `min_workers`, and `max_workers`.  When the
cluster is first brought up (or when it is refreshed with a subsequent
`ray up`) this number of nodes will be started.

It's a workaround for issues of scaling (see related issues) where it
can take a long time (or forever in the case where the head node has
`--num-cpus 0`) to scale up a cluster in response to increasing demand.

Related issue number

Workaround for #3339 and #2106

This option goes along with `min_workers`, and `max_workers`.  When the
cluster is first brought up (or when it is refreshed with a subsequent
`ray up`) this number of nodes will be started.

It's a workaround for issues of scaling (see related issues) where it
can take a long time (or forever in the case where the head node has
`--num-cpus 0`) to scale up a cluster in response to increasing demand.
@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10018/
Test PASSed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10016/
Test FAILed.

@richardliaw richardliaw changed the title Add an initial_workers option Dec 14, 2018
@ericl

ericl commented Dec 14, 2018

Copy link
Copy Markdown
Contributor

The implementation looks good. Could we add a unit test?

Comment thread python/ray/autoscaler/autoscaler.py Outdated
cur_used = self.load_metrics.approx_workers_used()
ideal_num_nodes = int(np.ceil(cur_used / float(target_frac)))
ideal_num_workers = ideal_num_nodes - 1 # subtract 1 for head node
initial_workers = self.config.get("initial_workers", 0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the right way for default values is to add them to example-full.yaml; this will automatically populate configs with missing values.

@ls-daniel

Copy link
Copy Markdown
Contributor

I've added initial_workers to default-full.yaml here and written a unit test.
I can't commit to this pull request so I'll hopefully be able to work that out with @mattearllongshot over the next few days.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10552/
Test FAILed.

@mattearllongshot

Copy link
Copy Markdown
Contributor Author

Is everyone happy to merge this now?

Comment thread test/autoscaler_test.py Outdated
update_interval_s=0)
self.waitForNodes(0)
autoscaler.update()
self.waitForNodes(5) # expected due to batch sizes and concurrency

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.waitForNodes(5) # expected due to batch sizes and concurrency
self.waitForNodes(5) # expected due to batch sizes and concurrency

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(lint)

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10588/
Test FAILed.

@richardliaw

Copy link
Copy Markdown
Contributor

jenkins retest this please

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10611/
Test PASSed.

@richardliaw richardliaw merged commit 681e8cd into ray-project:master Jan 6, 2019
@richardliaw

Copy link
Copy Markdown
Contributor

@mattearllongshot, @ls-daniel, thanks for contributing this; sorry for the delay in merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

5 participants