[autoscaler] Add an initial_workers option#3530
Conversation
This option goes along with `min_workers`, and `max_workers`. When the cluster is first brought up (or when it is refreshed with a subsequent `ray up`) this number of nodes will be started. It's a workaround for issues of scaling (see related issues) where it can take a long time (or forever in the case where the head node has `--num-cpus 0`) to scale up a cluster in response to increasing demand.
fff8fab to
815dc49
Compare
|
Test PASSed. |
|
Test FAILed. |
|
The implementation looks good. Could we add a unit test? |
| cur_used = self.load_metrics.approx_workers_used() | ||
| ideal_num_nodes = int(np.ceil(cur_used / float(target_frac))) | ||
| ideal_num_workers = ideal_num_nodes - 1 # subtract 1 for head node | ||
| initial_workers = self.config.get("initial_workers", 0) |
There was a problem hiding this comment.
I think the right way for default values is to add them to example-full.yaml; this will automatically populate configs with missing values.
|
I've added initial_workers to default-full.yaml here and written a unit test. |
|
Test FAILed. |
|
Is everyone happy to merge this now? |
| update_interval_s=0) | ||
| self.waitForNodes(0) | ||
| autoscaler.update() | ||
| self.waitForNodes(5) # expected due to batch sizes and concurrency |
There was a problem hiding this comment.
| self.waitForNodes(5) # expected due to batch sizes and concurrency | |
| self.waitForNodes(5) # expected due to batch sizes and concurrency |
|
Test FAILed. |
|
jenkins retest this please |
|
Test PASSed. |
|
@mattearllongshot, @ls-daniel, thanks for contributing this; sorry for the delay in merging! |
What do these changes do?
Related issue number
Workaround for #3339 and #2106