Skip to content

[core][rdt] Bring your own transport docs page#60308

Merged
dayshah merged 18 commits into
ray-project:masterfrom
dayshah:byot-docs
Apr 14, 2026
Merged

[core][rdt] Bring your own transport docs page#60308
dayshah merged 18 commits into
ray-project:masterfrom
dayshah:byot-docs

Conversation

@dayshah

@dayshah dayshah commented Jan 20, 2026

Copy link
Copy Markdown
Contributor

Description

Adding a docs page to explain to users how to register custom transports.

The docs page walks through an implementation of a custom shared memory transport and has some diagrams that show when each method is called.

I also updated the api docstrings of tensor transport manager to be more detailed for users and contributors

Registration of custom transports was added here #59255

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah requested a review from a team as a code owner January 20, 2026 02:46
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Jan 20, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new documentation page for implementing custom tensor transports in Ray Direct Transport (RDT), which is a great addition. The new page is comprehensive and provides a good overview of the TensorTransportManager interface. I've made a few suggestions to fix typos, improve clarity in some sentences, and correct code examples to ensure they are runnable. The refactoring of the existing documentation to link to this new page is also a good improvement.

Comment thread doc/source/ray-core/direct-transport/custom-tensor-transport.rst Outdated
Comment thread doc/source/ray-core/direct-transport/custom-tensor-transport.rst Outdated
Comment thread doc/source/ray-core/direct-transport/custom-tensor-transport.rst Outdated
Comment thread doc/source/ray-core/direct-transport/custom-tensor-transport.rst Outdated
Comment thread doc/source/ray-core/direct-transport/custom-tensor-transport.rst Outdated

- **Out-of-order actors** If you have an out-of-order actor (such as an async actor) and the process where you submit the actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task submission time.

For general RDT limitations, see :ref:`limitations <limitations>`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The reference :ref:limitations `` is ambiguous because this document has its own "Limitations" section (starting at line 301). While this probably points to the general RDT limitations in direct-transport.rst, it's confusing for readers and potentially for Sphinx. To improve clarity, consider using more specific link text, for example: `For general RDT limitations, see :ref:`the RDT limitations page `. It's also a good practice to use globally unique labels for cross-referencing to avoid ambiguity, e.g., by renaming the label in the target document to something like `rdt-general-limitations`.

@ray-gardener ray-gardener Bot added the core Issues that should be addressed in Ray Core label Jan 20, 2026
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
| | |
| <---- comm metadata --------- | ---- comm metadata --------> |
| | |
4. send_multiple_tensors | 5. recv_multiple_tensors

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to mention that send_multiple_tensors is not called for one-sided transports and the user can just raise an exception like we do in the nixl transport class

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a little line right below here saying that, and said the exception thing later.

Source Actor Owner Process Destination Actor
============ ============= =================
| | |
1. Task returns tensor | |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object ref for tensor

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the user perspective though they're just returning a tensor? obj ref is kind of an internal detail

| ------------ tensors ---------------------------------------> |
| | |
| (transfer complete) |
| | |

@Sparks0219 Sparks0219 Feb 9, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to distinguish what happens on both sides. Like the source actor side where we do ray.put tracks the the lifecycle via ray object ref, but on the destination actor side we pop the object out of the gpu object store when we do ray.get(...) and it's lifecycle is controlled solely by python. Eh might not really belong in this doc though

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya i think this would make sense for an internal doc, but want this to be more user-facing, just what you need to know to implement your own transport

- If ``True``: Ray calls `abort_transport` on both the source and destination actors when a send / recv error, allowing your transport to clean up gracefully.
- If ``False``: Ray kills the involved actors to prevent deadlocks when errors occur during transfer.

Return ``True`` only if your transport can reliably interrupt an in-progress send or receive operation without leaving either party in a blocked state.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deadlocked state?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like blocked still makes sense here and is a bit more general?

Comment thread doc/source/ray-core/direct-transport/custom-tensor-transport.rst Outdated
garbage_collect
^^^^^^^^^^^^^^^

Cleans up resources when Ray decides to free the RDT object. Ray calls this only on the source actor after Ray's distributed reference counting protocol determines the object is out of scope.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens on the receive side? Probably good to mention both cases (when you specify the receive into user buffer, and when you don't)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sentence about lifetime being controlled by the user on the recv side.

Comment thread doc/source/ray-core/direct-transport/direct-transport.rst
First, install NIXL with a plain ``pip install nixl``.
For maximum performance, run the `install_gdrcopy.sh <https://github.com/ray-project/ray/blob/master/doc/tools/install_gdrcopy.sh>`__ script (e.g., ``install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"``). You can find available OS versions `here <https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/>`__.

Note that you should also set these UCX environment variables to either let UCX choose the right transport from all options, or so that you can yourself set your preferred transport option.

@Sparks0219 Sparks0219 Feb 9, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait what does this mean "let UCX choose the right transport from all options, or so that you can yourself set your preferred transport option". UCX will infer whether I want to use NIXL or Cuda IPC for example?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya ucx "should" infer the best way to transfer

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using RDT though we don't have an option to let UCX pick the transport right?

@dayshah dayshah Feb 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we, by default on benchmarks where we want to use nvlink we let ucx pick and it seems to do the right thing

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah requested a review from Sparks0219 February 11, 2026 19:52

The :class:`TensorTransportManager <ray.experimental.TensorTransportManager>` abstract class defines the interface for custom tensor transports. You must implement all abstract methods.

The following diagram shows when each method is called during a tensor transfer:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great diagram! Could you add one for ray.put as well?

@Sparks0219 Sparks0219 Feb 12, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, though in the future these diagrams should prob be moved to the general rdt internal doc 🔜

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, i kind of want slightly diff diagrams for internal doc since it's gonna be more geared towards core developers vs. people who just want their own transport and kinda wanna understand rdt but not core

Transport identification methods
--------------------------------

tensor_transport_backend

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these could be moved into an API ref page and the descriptions here should be docstrings on the methods.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for it to be in the api ref page though these have to be public api's which we probably don't want them to be?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm i realized that they the methods show up in the public api page because tensor transport manager is public, gonna update this

# the device of the tensor, currently, all tensors in the list must have the same device type
tensor_device: Optional[torch.device] = None

You can extend this class to store transport-specific metadata. For example, if your send or recv needs a source address, you can store it here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "source address"? Like an IP address?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya kinda, i was thinking like read from source address or read from source rank. I updated it to say source buffer metadata instead of source address.


@dataclass
class MyCustomTransportMetadata(TensorTransportMetadata):
buffer_ids: Optional[List[Any]] = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more useful to add an actual example, maybe like a toy one that puts data into a named actor or something? You can also do this as a follow-up if it's too much for now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying like a full working toy example?
I kind of have that in the test and was thinking of linking that, but probably weird to link to a test file, so could also copy out into a doc file.

garbage_collect
^^^^^^^^^^^^^^^

Cleans up resources when Ray decides to free the RDT object. Ray calls this on the source actor after Ray's distributed reference counting protocol determines the object is out of scope.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to specify how GC for non-source actors' copies works, so they know that they aren't GCed through this codepath.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya added a sentence for what happens for the receiver side copy.


- **Registration must happen before actor creation.** You must create your actor on the same process where you registered the custom tensor transport. For example, you can't use your custom transport with an actor if you registered the custom transport on the driver and created the actor inside a task.

- **Actors only access transports registered before their creation.** If you register a transport after creating an actor, that actor can't use the new transport.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could squash with the previous bullet?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my bad, the description of 2 is 4, and the title is 3 💀. Just deleted it, and cleaned up the others.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait nvm i confused myself there's separate points here... Reworded a bit, the section should make more sense


- **Actors only access transports registered before their creation.** If you register a transport after creating an actor, that actor can't use the new transport.

- **Out-of-order actors** If you have an out-of-order actor (such as an async actor) and the process where you submit the actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task submission time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording here is a bit hard to understand. Also should it be OR not AND for the first sentence?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, the wording confused me too looking back at it reworked this section.

Comment thread doc/source/ray-core/direct-transport/direct-transport.rst
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>

- **Out-of-order actors** If you have an out-of-order actor (such as an async actor) and the process where you submit the actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task execution time.

- **Actor creation and task submission from different processes** If the process where you submit an actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task execution time.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate limitation bullets about different processes

Low Severity

The "Out-of-order actors" bullet and the "Actor creation and task submission from different processes" bullet describe nearly identical limitations — both state that if the process submitting the actor task differs from the one that created the actor, Ray can't guarantee registration. The second bullet is a strict superset of the first. A PR reviewer already flagged this with "Could squash with the previous bullet?"

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh the effect is the same, but they're different causes to a user

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.


- **Out-of-order actors** If you have an out-of-order actor (such as an async actor) and the process where you submit the actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task execution time.

- **Actor creation and task submission from different processes** If the process where you submit an actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task execution time.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate limitation bullet points with identical content

Low Severity

The "Out-of-order actors" and "Actor creation and task submission from different processes" bullets describe effectively the same limitation — both state that if the task submission process differs from the actor creation process, custom transport registration can't be guaranteed. These are redundant and could be consolidated. Also, the "Out-of-order actors" bullet is missing a period/colon after the bold text, inconsistent with the other bullets.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're two different things from the user perspective

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@github-actions

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 10, 2026
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 15, 2026
Signed-off-by: dayshah <dhyey2019@gmail.com>

@stephanie-wang stephanie-wang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example is very nice! Just some suggestions on wording.

Comment thread doc/source/ray-core/direct-transport/custom-tensor-transport.rst Outdated
Comment thread doc/source/ray-core/direct-transport/custom-tensor-transport.rst Outdated
Comment thread doc/source/ray-core/direct-transport/custom-tensor-transport.rst Outdated
Comment thread doc/source/ray-core/direct-transport/custom-tensor-transport.rst Outdated
Comment thread doc/source/ray-core/direct-transport/custom-tensor-transport.rst Outdated
Comment thread python/ray/experimental/rdt/tensor_transport_manager.py Outdated
Comment on lines +202 to +203
Ray doesn't hold the tensor after returning it to the user on the recv side, so it can be
garbage collected whenever the user stops using it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this sentence a bit confusing / not sure what the point was - are you trying to say that the user doesn't need to do any cleanup on the recv side?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya... reworded a bit

Comment thread python/ray/experimental/rdt/tensor_transport_manager.py Outdated
Comment thread python/ray/experimental/rdt/tensor_transport_manager.py Outdated
Comment thread python/ray/experimental/rdt/tensor_transport_manager.py Outdated
@dayshah dayshah requested a review from stephanie-wang April 13, 2026 01:30
Signed-off-by: dayshah <dhyey2019@gmail.com>

@stephanie-wang stephanie-wang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@dayshah dayshah merged commit f064c8d into ray-project:master Apr 14, 2026
6 checks passed
HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026
Signed-off-by: dayshah <dhyey2019@gmail.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
Signed-off-by: dayshah <dhyey2019@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

3 participants