Skip to content

[rllib] Parallel-data loading and multi-gpu support for IMPALA#2766

Merged
ericl merged 51 commits into
ray-project:masterfrom
ericl:impala-multigpu
Oct 15, 2018
Merged

[rllib] Parallel-data loading and multi-gpu support for IMPALA#2766
ericl merged 51 commits into
ray-project:masterfrom
ericl:impala-multigpu

Conversation

@ericl

@ericl ericl commented Aug 29, 2018

Copy link
Copy Markdown
Contributor

What do these changes do?

  • Add support for multi-gpu optimizer to IMPALA.
  • Add option for parallel data loading and replay.
  • These are all disabled by default.

Related issue number

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7862/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7863/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7951/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/7952/
Test FAILed.

@ericl ericl left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think the multi-GPU stuff is too messy and we should move away from it. However it's not clear what the best way to do that is, perhaps with TF eager, or moving a lot of the logic into numpy, it would be simpler.

return Resources(
cpu=1,
gpu=cf["gpu"] and cf["gpu_fraction"] or 0,
gpu=cf["num_gpus"] and cf["num_gpus"] * cf["gpu_fraction"] or 0,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what will happen if gpu_fraction > 1, but for < 1 I believe it should work.

@ericl

ericl commented Oct 14, 2018

Copy link
Copy Markdown
Contributor Author

Also, having public documentation for the shared configs and algorithm specific things would be really good.

It would be good to figure this out. Sphinx doesn't seem to have a good way to do this, unless you duplicate the comments in the dict, or change it to a class or something. For now I think it's reasonable to expect users to read the code since it's at the top of the file.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8626/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8625/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8629/
Test FAILed.

@ericl

ericl commented Oct 14, 2018

Copy link
Copy Markdown
Contributor Author

I just did a quick multi-GPU run on PPO / IMPALA atari and perf looks good still.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8631/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8637/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8638/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8642/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8644/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8643/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8653/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8661/
Test PASSed.

@ericl ericl merged commit 3c891c6 into ray-project:master Oct 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants