Skip to content

[xray] Implement timeline and profiling API.#2306

Merged
pcmoritz merged 25 commits into
ray-project:masterfrom
robertnishihara:timeline
Jul 5, 2018
Merged

[xray] Implement timeline and profiling API.#2306
pcmoritz merged 25 commits into
ray-project:masterfrom
robertnishihara:timeline

Conversation

@robertnishihara

@robertnishihara robertnishihara commented Jun 26, 2018

Copy link
Copy Markdown
Collaborator

This PR adds more detailed profiling information to the timeline for xray and allows you to view the timeline visualization for xray. It also improves the profiling API (for xray).

Example usage. To get a json dump of the profiling information that can be loaded in chrome://tracing, simply do

# Optionally pass in filename= to specify a file to write to.
ray.global_state.chrome_tracing_dump()

The dump will include profiling information for tasks (including the prior breakdown into deserializing arguments, executing the function, getting the results) as well as ray.put, ray.get, ray.wait, task submission, and various other little things. Profiling information will also be included from the driver, and this PR allows for the possibility of creating profiling events from any process that has a GCS client.

To add a custom span to the timeline, simply do

with ray.profile('event name'):
    # Do something.

Then this will appear in the timeline. This can be done within a task or on the driver. An example timeline is attached.

screen shot 2018-06-26 at 1 17 47 pm

Not implemented in this PR

  • Timeline integration with web UI.

TODO in this PR maybe:

  • Documentation of how to use the timeline/profiling tools.
  • Option for opening timeline in browser automatically.
  • Include task spec info in timeline (does this actually matter?).
  • Cleanups
    • Remove duplicate ProfileTableData flatbuffer definition.
    • Remove unused AddProfileEvent method.
  • Include node IP address correctly in timeline!
  • Double check that the timeline still works in legacy Ray.
  • Fix tests.
  • Check performance.

cc @alanamarzoev

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6298/
Test PASSed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6299/
Test PASSed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6302/
Test PASSed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6317/
Test PASSed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6330/
Test PASSed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6380/
Test PASSed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6381/
Test PASSed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6385/
Test PASSed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6435/
Test PASSed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6436/
Test PASSed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6447/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6449/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6450/
Test FAILed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6451/
Test PASSed.

@robertnishihara

Copy link
Copy Markdown
Collaborator Author

This PR decreases performance for

%time l = ray.get([f.remote() for _ in range(10000)])

from about 1.3s to about 1.6s.

I think a natural solution to this is to do more batching (in particular to flush from workers on a timer instead of after every task). However, that runs into some of the thread-safety issues brought up in #2248, so I may put that on hold for now.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6464/
Test FAILed.

@robertnishihara

Copy link
Copy Markdown
Collaborator Author

jenkins, retest this please

Comment thread python/ray/experimental/state.py Outdated
def _profile_table(self, component_id):
"""Get the profile events for a given component.

TODO(rkn): This method should support limiting the number of log events

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODOs should not be part of the doc string (user documentation)

Comment thread python/ray/experimental/state.py Outdated
web browser and load the dumped file. Make sure to enable "Flow events"
in the "View Options" menu.

TODO(rkn): Support including the task specification data in the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODOs should not be part of the docstring

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6466/
Test PASSed.

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6467/
Test PASSed.

/// Store some profile events in the GCS.
///
/// \param conn The connection information.
/// \param event_type The type of the event.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated

Comment thread src/ray/raylet/CMakeLists.txt Outdated
COMMENT "Running flatc compiler on ${NODE_MANAGER_FBS_SRC}"
VERBATIM)

#add_custom_target(gen_node_manager_fbs)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

@pcmoritz pcmoritz merged commit b90e551 into ray-project:master Jul 5, 2018
@pcmoritz pcmoritz deleted the timeline branch July 5, 2018 06:23
@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/6481/
Test PASSed.

@raulchen

raulchen commented Jul 9, 2018

Copy link
Copy Markdown
Contributor

hi @robertnishihara, it looks like that worker.events is accessed from 2 threads, but it's not protected by any lock. I think this could cause problems in some cases.

For example, if events.append is called between these 2 statements, the data will be lost

ray/python/ray/worker.py

Lines 2733 to 2737 in 4ef9d15

worker.local_scheduler_client.push_profile_events(
component_type, ray.ObjectID(worker.worker_id),
worker.node_ip_address, worker.events)
worker.events = []
.

Is my understanding correct? If so, I can file a PR to fix this.

Another quick question, is there any reason why only raylet uses background thread?

@raulchen

raulchen commented Jul 9, 2018

Copy link
Copy Markdown
Contributor

Just found that the background thread is only used for drivers. Why do workers not use it?

@robertnishihara

Copy link
Copy Markdown
Collaborator Author

@raulchen you're right, I think it should be protected by a lock.

It's only used in the raylet because I don't think we should add new functionality or change the behavior for the legacy code path.

It's currently only used on the driver, but I think it actually makes sense to use in workers as well. Basically, each worker/driver could just have a background thread that calls _flush_profile_events once a second or so. Then we could remove this line

flush_profile_data()
from the raylet code path.

Comment thread python/ray/worker.py
# Note(rkn): This is run on a background thread in the driver. It uses the
# local scheduler client. This should be ok because it doesn't read from
# the local scheduler client and we have the GIL here. However, if either
# of those things changes, then we could run into issues.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing some research in depth, I believe this is still problematic. The reason is because the get_task function isn't protected by by the GIL See

.

Here's the more detailed explanation of how things can go wrong:

  1. the main thread executes the get_task function and switches out when it's in the middle of write_message. (the request is half-sent now)
  2. then the flush-profile-data thread also calls into write_message via push_profile_events, and also write data to the connection.
  3. then the server will receive a corrupted request.
@raulchen raulchen mentioned this pull request Jul 10, 2018
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants