[core][rdt] Bring your own transport docs page#60308
Conversation
Signed-off-by: dayshah <dhyey2019@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new documentation page for implementing custom tensor transports in Ray Direct Transport (RDT), which is a great addition. The new page is comprehensive and provides a good overview of the TensorTransportManager interface. I've made a few suggestions to fix typos, improve clarity in some sentences, and correct code examples to ensure they are runnable. The refactoring of the existing documentation to link to this new page is also a good improvement.
|
|
||
| - **Out-of-order actors** If you have an out-of-order actor (such as an async actor) and the process where you submit the actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task submission time. | ||
|
|
||
| For general RDT limitations, see :ref:`limitations <limitations>`. |
There was a problem hiding this comment.
The reference :ref:limitations `` is ambiguous because this document has its own "Limitations" section (starting at line 301). While this probably points to the general RDT limitations in direct-transport.rst, it's confusing for readers and potentially for Sphinx. To improve clarity, consider using more specific link text, for example: `For general RDT limitations, see :ref:`the RDT limitations page `. It's also a good practice to use globally unique labels for cross-referencing to avoid ambiguity, e.g., by renaming the label in the target document to something like `rdt-general-limitations`.
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
| | | | | ||
| | <---- comm metadata --------- | ---- comm metadata --------> | | ||
| | | | | ||
| 4. send_multiple_tensors | 5. recv_multiple_tensors |
There was a problem hiding this comment.
might be good to mention that send_multiple_tensors is not called for one-sided transports and the user can just raise an exception like we do in the nixl transport class
There was a problem hiding this comment.
I added a little line right below here saying that, and said the exception thing later.
| Source Actor Owner Process Destination Actor | ||
| ============ ============= ================= | ||
| | | | | ||
| 1. Task returns tensor | | |
There was a problem hiding this comment.
from the user perspective though they're just returning a tensor? obj ref is kind of an internal detail
| | ------------ tensors ---------------------------------------> | | ||
| | | | | ||
| | (transfer complete) | | ||
| | | | |
There was a problem hiding this comment.
Might be good to distinguish what happens on both sides. Like the source actor side where we do ray.put tracks the the lifecycle via ray object ref, but on the destination actor side we pop the object out of the gpu object store when we do ray.get(...) and it's lifecycle is controlled solely by python. Eh might not really belong in this doc though
There was a problem hiding this comment.
ya i think this would make sense for an internal doc, but want this to be more user-facing, just what you need to know to implement your own transport
| - If ``True``: Ray calls `abort_transport` on both the source and destination actors when a send / recv error, allowing your transport to clean up gracefully. | ||
| - If ``False``: Ray kills the involved actors to prevent deadlocks when errors occur during transfer. | ||
|
|
||
| Return ``True`` only if your transport can reliably interrupt an in-progress send or receive operation without leaving either party in a blocked state. |
There was a problem hiding this comment.
i feel like blocked still makes sense here and is a bit more general?
| garbage_collect | ||
| ^^^^^^^^^^^^^^^ | ||
|
|
||
| Cleans up resources when Ray decides to free the RDT object. Ray calls this only on the source actor after Ray's distributed reference counting protocol determines the object is out of scope. |
There was a problem hiding this comment.
what happens on the receive side? Probably good to mention both cases (when you specify the receive into user buffer, and when you don't)
There was a problem hiding this comment.
Added a sentence about lifetime being controlled by the user on the recv side.
| First, install NIXL with a plain ``pip install nixl``. | ||
| For maximum performance, run the `install_gdrcopy.sh <https://github.com/ray-project/ray/blob/master/doc/tools/install_gdrcopy.sh>`__ script (e.g., ``install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"``). You can find available OS versions `here <https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/>`__. | ||
|
|
||
| Note that you should also set these UCX environment variables to either let UCX choose the right transport from all options, or so that you can yourself set your preferred transport option. |
There was a problem hiding this comment.
wait what does this mean "let UCX choose the right transport from all options, or so that you can yourself set your preferred transport option". UCX will infer whether I want to use NIXL or Cuda IPC for example?
There was a problem hiding this comment.
ya ucx "should" infer the best way to transfer
There was a problem hiding this comment.
When using RDT though we don't have an option to let UCX pick the transport right?
There was a problem hiding this comment.
why don't we, by default on benchmarks where we want to use nvlink we let ucx pick and it seems to do the right thing
|
|
||
| The :class:`TensorTransportManager <ray.experimental.TensorTransportManager>` abstract class defines the interface for custom tensor transports. You must implement all abstract methods. | ||
|
|
||
| The following diagram shows when each method is called during a tensor transfer: |
There was a problem hiding this comment.
Great diagram! Could you add one for ray.put as well?
There was a problem hiding this comment.
+1, though in the future these diagrams should prob be moved to the general rdt internal doc 🔜
There was a problem hiding this comment.
added, i kind of want slightly diff diagrams for internal doc since it's gonna be more geared towards core developers vs. people who just want their own transport and kinda wanna understand rdt but not core
| Transport identification methods | ||
| -------------------------------- | ||
|
|
||
| tensor_transport_backend |
There was a problem hiding this comment.
I think these could be moved into an API ref page and the descriptions here should be docstrings on the methods.
There was a problem hiding this comment.
for it to be in the api ref page though these have to be public api's which we probably don't want them to be?
There was a problem hiding this comment.
nvm i realized that they the methods show up in the public api page because tensor transport manager is public, gonna update this
| # the device of the tensor, currently, all tensors in the list must have the same device type | ||
| tensor_device: Optional[torch.device] = None | ||
|
|
||
| You can extend this class to store transport-specific metadata. For example, if your send or recv needs a source address, you can store it here. |
There was a problem hiding this comment.
What is a "source address"? Like an IP address?
There was a problem hiding this comment.
ya kinda, i was thinking like read from source address or read from source rank. I updated it to say source buffer metadata instead of source address.
|
|
||
| @dataclass | ||
| class MyCustomTransportMetadata(TensorTransportMetadata): | ||
| buffer_ids: Optional[List[Any]] = None |
There was a problem hiding this comment.
I think it would be more useful to add an actual example, maybe like a toy one that puts data into a named actor or something? You can also do this as a follow-up if it's too much for now.
There was a problem hiding this comment.
Are you saying like a full working toy example?
I kind of have that in the test and was thinking of linking that, but probably weird to link to a test file, so could also copy out into a doc file.
| garbage_collect | ||
| ^^^^^^^^^^^^^^^ | ||
|
|
||
| Cleans up resources when Ray decides to free the RDT object. Ray calls this on the source actor after Ray's distributed reference counting protocol determines the object is out of scope. |
There was a problem hiding this comment.
Might want to specify how GC for non-source actors' copies works, so they know that they aren't GCed through this codepath.
There was a problem hiding this comment.
Ya added a sentence for what happens for the receiver side copy.
|
|
||
| - **Registration must happen before actor creation.** You must create your actor on the same process where you registered the custom tensor transport. For example, you can't use your custom transport with an actor if you registered the custom transport on the driver and created the actor inside a task. | ||
|
|
||
| - **Actors only access transports registered before their creation.** If you register a transport after creating an actor, that actor can't use the new transport. |
There was a problem hiding this comment.
Could squash with the previous bullet?
There was a problem hiding this comment.
oh my bad, the description of 2 is 4, and the title is 3 💀. Just deleted it, and cleaned up the others.
There was a problem hiding this comment.
wait nvm i confused myself there's separate points here... Reworded a bit, the section should make more sense
|
|
||
| - **Actors only access transports registered before their creation.** If you register a transport after creating an actor, that actor can't use the new transport. | ||
|
|
||
| - **Out-of-order actors** If you have an out-of-order actor (such as an async actor) and the process where you submit the actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task submission time. |
There was a problem hiding this comment.
The wording here is a bit hard to understand. Also should it be OR not AND for the first sentence?
There was a problem hiding this comment.
Ya, the wording confused me too looking back at it reworked this section.
Signed-off-by: dayshah <dhyey2019@gmail.com>
|
|
||
| - **Out-of-order actors** If you have an out-of-order actor (such as an async actor) and the process where you submit the actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task execution time. | ||
|
|
||
| - **Actor creation and task submission from different processes** If the process where you submit an actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task execution time. |
There was a problem hiding this comment.
Duplicate limitation bullets about different processes
Low Severity
The "Out-of-order actors" bullet and the "Actor creation and task submission from different processes" bullet describe nearly identical limitations — both state that if the process submitting the actor task differs from the one that created the actor, Ray can't guarantee registration. The second bullet is a strict superset of the first. A PR reviewer already flagged this with "Could squash with the previous bullet?"
There was a problem hiding this comment.
Uh the effect is the same, but they're different causes to a user
|
|
||
| - **Out-of-order actors** If you have an out-of-order actor (such as an async actor) and the process where you submit the actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task execution time. | ||
|
|
||
| - **Actor creation and task submission from different processes** If the process where you submit an actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task execution time. |
There was a problem hiding this comment.
Duplicate limitation bullet points with identical content
Low Severity
The "Out-of-order actors" and "Actor creation and task submission from different processes" bullets describe effectively the same limitation — both state that if the task submission process differs from the actor creation process, custom transport registration can't be guaranteed. These are redundant and could be consolidated. Also, the "Out-of-order actors" bullet is missing a period/colon after the bold text, inconsistent with the other bullets.
There was a problem hiding this comment.
They're two different things from the user perspective
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
stephanie-wang
left a comment
There was a problem hiding this comment.
The example is very nice! Just some suggestions on wording.
| Ray doesn't hold the tensor after returning it to the user on the recv side, so it can be | ||
| garbage collected whenever the user stops using it. |
There was a problem hiding this comment.
I found this sentence a bit confusing / not sure what the point was - are you trying to say that the user doesn't need to do any cleanup on the recv side?
There was a problem hiding this comment.
Ya... reworded a bit
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>


Description
Adding a docs page to explain to users how to register custom transports.
The docs page walks through an implementation of a custom shared memory transport and has some diagrams that show when each method is called.
I also updated the api docstrings of tensor transport manager to be more detailed for users and contributors
Registration of custom transports was added here #59255