[core][rdt] Bring your own transport docs page by dayshah · Pull Request #60308 · ray-project/ray

dayshah · 2026-01-20T02:46:21Z

Description

Adding a docs page to explain to users how to register custom transports.

The docs page walks through an implementation of a custom shared memory transport and has some diagrams that show when each method is called.

I also updated the api docstrings of tensor transport manager to be more detailed for users and contributors

Registration of custom transports was added here #59255

Signed-off-by: dayshah <dhyey2019@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a new documentation page for implementing custom tensor transports in Ray Direct Transport (RDT), which is a great addition. The new page is comprehensive and provides a good overview of the TensorTransportManager interface. I've made a few suggestions to fix typos, improve clarity in some sentences, and correct code examples to ensure they are runnable. The refactoring of the existing documentation to link to this new page is also a good improvement.

gemini-code-assist · 2026-01-20T02:48:35Z

+
+- **Out-of-order actors** If you have an out-of-order actor (such as an async actor) and the process where you submit the actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task submission time.
+
+For general RDT limitations, see :ref:`limitations <limitations>`.


The reference :ref:limitations `` is ambiguous because this document has its own "Limitations" section (starting at line 301). While this probably points to the general RDT limitations in direct-transport.rst, it's confusing for readers and potentially for Sphinx. To improve clarity, consider using more specific link text, for example: `For general RDT limitations, see :ref:`the RDT limitations page `. It's also a good practice to use globally unique labels for cross-referencing to avoid ambiguity, e.g., by renaming the label in the target document to something like `rdt-general-limitations`.

Signed-off-by: dayshah <dhyey2019@gmail.com>

Sparks0219 · 2026-02-09T22:03:28Z

+        |                               |                               |
+        | <---- comm metadata --------- | ---- comm metadata -------->  |
+        |                               |                               |
+   4. send_multiple_tensors             |          5. recv_multiple_tensors


might be good to mention that send_multiple_tensors is not called for one-sided transports and the user can just raise an exception like we do in the nixl transport class

I added a little line right below here saying that, and said the exception thing later.

Sparks0219 · 2026-02-09T22:04:38Z

+   Source Actor                    Owner Process                 Destination Actor
+   ============                    =============                 =================
+        |                               |                               |
+   1. Task returns tensor               |                               |


object ref for tensor

from the user perspective though they're just returning a tensor? obj ref is kind of an internal detail

Sparks0219 · 2026-02-09T22:09:17Z

+        | ------------ tensors ---------------------------------------> |
+        |                               |                               |
+        |                         (transfer complete)                   |
+        |                               |                               |


Might be good to distinguish what happens on both sides. Like the source actor side where we do ray.put tracks the the lifecycle via ray object ref, but on the destination actor side we pop the object out of the gpu object store when we do ray.get(...) and it's lifecycle is controlled solely by python. Eh might not really belong in this doc though

ya i think this would make sense for an internal doc, but want this to be more user-facing, just what you need to know to implement your own transport

Sparks0219 · 2026-02-09T22:12:25Z

+- If ``True``: Ray calls `abort_transport` on both the source and destination actors when a send / recv error, allowing your transport to clean up gracefully.
+- If ``False``: Ray kills the involved actors to prevent deadlocks when errors occur during transfer.
+
+Return ``True`` only if your transport can reliably interrupt an in-progress send or receive operation without leaving either party in a blocked state.


deadlocked state?

i feel like blocked still makes sense here and is a bit more general?

Sparks0219 · 2026-02-09T22:26:58Z

+garbage_collect
+^^^^^^^^^^^^^^^
+
+Cleans up resources when Ray decides to free the RDT object. Ray calls this only on the source actor after Ray's distributed reference counting protocol determines the object is out of scope.


what happens on the receive side? Probably good to mention both cases (when you specify the receive into user buffer, and when you don't)

Added a sentence about lifetime being controlled by the user on the recv side.

Sparks0219 · 2026-02-09T22:35:33Z

+First, install NIXL with a plain ``pip install nixl``.
+For maximum performance, run the `install_gdrcopy.sh <https://github.com/ray-project/ray/blob/master/doc/tools/install_gdrcopy.sh>`__ script (e.g., ``install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"``). You can find available OS versions `here <https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/>`__. 
+
+Note that you should also set these UCX environment variables to either let UCX choose the right transport from all options, or so that you can yourself set your preferred transport option.


wait what does this mean "let UCX choose the right transport from all options, or so that you can yourself set your preferred transport option". UCX will infer whether I want to use NIXL or Cuda IPC for example?

ya ucx "should" infer the best way to transfer

When using RDT though we don't have an option to let UCX pick the transport right?

why don't we, by default on benchmarks where we want to use nvlink we let ucx pick and it seems to do the right thing

Signed-off-by: dayshah <dhyey2019@gmail.com>

stephanie-wang · 2026-02-11T22:44:28Z

+
+The :class:`TensorTransportManager <ray.experimental.TensorTransportManager>` abstract class defines the interface for custom tensor transports. You must implement all abstract methods.
+
+The following diagram shows when each method is called during a tensor transfer:


Great diagram! Could you add one for ray.put as well?

+1, though in the future these diagrams should prob be moved to the general rdt internal doc 🔜

added, i kind of want slightly diff diagrams for internal doc since it's gonna be more geared towards core developers vs. people who just want their own transport and kinda wanna understand rdt but not core

stephanie-wang · 2026-02-11T22:45:07Z

+Transport identification methods
+--------------------------------
+
+tensor_transport_backend


I think these could be moved into an API ref page and the descriptions here should be docstrings on the methods.

for it to be in the api ref page though these have to be public api's which we probably don't want them to be?

nvm i realized that they the methods show up in the public api page because tensor transport manager is public, gonna update this

stephanie-wang · 2026-02-11T22:48:13Z

+        # the device of the tensor, currently, all tensors in the list must have the same device type
+        tensor_device: Optional[torch.device] = None
+
+You can extend this class to store transport-specific metadata. For example, if your send or recv needs a source address, you can store it here.


What is a "source address"? Like an IP address?

ya kinda, i was thinking like read from source address or read from source rank. I updated it to say source buffer metadata instead of source address.

stephanie-wang · 2026-02-11T22:49:13Z

+
+    @dataclass
+    class MyCustomTransportMetadata(TensorTransportMetadata):
+        buffer_ids: Optional[List[Any]] = None


I think it would be more useful to add an actual example, maybe like a toy one that puts data into a named actor or something? You can also do this as a follow-up if it's too much for now.

Are you saying like a full working toy example?
I kind of have that in the test and was thinking of linking that, but probably weird to link to a test file, so could also copy out into a doc file.

stephanie-wang · 2026-02-11T22:50:53Z

+garbage_collect
+^^^^^^^^^^^^^^^
+
+Cleans up resources when Ray decides to free the RDT object. Ray calls this on the source actor after Ray's distributed reference counting protocol determines the object is out of scope.


Might want to specify how GC for non-source actors' copies works, so they know that they aren't GCed through this codepath.

Ya added a sentence for what happens for the receiver side copy.

stephanie-wang · 2026-02-11T22:52:30Z

+
+- **Registration must happen before actor creation.** You must create your actor on the same process where you registered the custom tensor transport. For example, you can't use your custom transport with an actor if you registered the custom transport on the driver and created the actor inside a task.
+
+- **Actors only access transports registered before their creation.** If you register a transport after creating an actor, that actor can't use the new transport.


Could squash with the previous bullet?

oh my bad, the description of 2 is 4, and the title is 3 💀. Just deleted it, and cleaned up the others.

wait nvm i confused myself there's separate points here... Reworded a bit, the section should make more sense

stephanie-wang · 2026-02-11T22:54:49Z

+
+- **Actors only access transports registered before their creation.** If you register a transport after creating an actor, that actor can't use the new transport.
+
+- **Out-of-order actors** If you have an out-of-order actor (such as an async actor) and the process where you submit the actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task submission time.


The wording here is a bit hard to understand. Also should it be OR not AND for the first sentence?

Ya, the wording confused me too looking back at it reworked this section.

Signed-off-by: dayshah <dhyey2019@gmail.com>

cursor · 2026-02-12T06:45:16Z

+
+- **Out-of-order actors** If you have an out-of-order actor (such as an async actor) and the process where you submit the actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task execution time.
+
+- **Actor creation and task submission from different processes** If the process where you submit an actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task execution time.


Duplicate limitation bullets about different processes

Low Severity

The "Out-of-order actors" bullet and the "Actor creation and task submission from different processes" bullet describe nearly identical limitations — both state that if the process submitting the actor task differs from the one that created the actor, Ray can't guarantee registration. The second bullet is a strict superset of the first. A PR reviewer already flagged this with "Could squash with the previous bullet?"

Uh the effect is the same, but they're different causes to a user

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-02-12T06:56:24Z

+
+- **Out-of-order actors** If you have an out-of-order actor (such as an async actor) and the process where you submit the actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task execution time.
+
+- **Actor creation and task submission from different processes** If the process where you submit an actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task execution time.


Duplicate limitation bullet points with identical content

Low Severity

The "Out-of-order actors" and "Actor creation and task submission from different processes" bullets describe effectively the same limitation — both state that if the task submission process differs from the actor creation process, custom transport registration can't be guaranteed. These are redundant and could be consolidated. Also, the "Out-of-order actors" bullet is missing a period/colon after the bold text, inconsistent with the other bullets.

They're two different things from the user perspective

Signed-off-by: dayshah <dhyey2019@gmail.com>

github-actions · 2026-03-10T12:27:21Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Signed-off-by: dayshah <dhyey2019@gmail.com>

stephanie-wang

The example is very nice! Just some suggestions on wording.

stephanie-wang · 2026-03-17T18:32:59Z

+        Ray doesn't hold the tensor after returning it to the user on the recv side, so it can be
+        garbage collected whenever the user stops using it.


I found this sentence a bit confusing / not sure what the point was - are you trying to say that the user doesn't need to do any cleanup on the recv side?

Ya... reworded a bit

Signed-off-by: dayshah <dhyey2019@gmail.com>

stephanie-wang

Looks great!

Signed-off-by: dayshah <dhyey2019@gmail.com>

[core][rdt] Bring your own transport docs page

1439066

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah assigned stephanie-wang and Sparks0219 Jan 20, 2026

dayshah requested a review from a team as a code owner January 20, 2026 02:46

dayshah added the go add ONLY when ready to merge, run all tests label Jan 20, 2026

gemini-code-assist Bot reviewed Jan 20, 2026

View reviewed changes

ray-gardener Bot added the core Issues that should be addressed in Ray Core label Jan 20, 2026

dayshah added 5 commits January 22, 2026 11:26

fix docs build

f278e39

Signed-off-by: dayshah <dhyey2019@gmail.com>

Merge branch 'master' into byot-docs

f983f72

Signed-off-by: dayshah <dhyey2019@gmail.com>

fix

3469784

Signed-off-by: dayshah <dhyey2019@gmail.com>

fix

4d42966

Signed-off-by: dayshah <dhyey2019@gmail.com>

Merge branch 'master' into byot-docs

4a7e51b

Signed-off-by: dayshah <dhyey2019@gmail.com>

Sparks0219 reviewed Feb 9, 2026

View reviewed changes

Comment thread doc/source/ray-core/direct-transport/custom-tensor-transport.rst Outdated

Sparks0219 reviewed Feb 9, 2026

View reviewed changes

Comment thread doc/source/ray-core/direct-transport/direct-transport.rst

Sparks0219 reviewed Feb 9, 2026

View reviewed changes

up

8466c95

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah requested a review from Sparks0219 February 11, 2026 19:52

stephanie-wang reviewed Feb 11, 2026

View reviewed changes

dayshah added 2 commits February 11, 2026 22:32

up

518abb7

Signed-off-by: dayshah <dhyey2019@gmail.com>

fix put diagram

56b8a7d

Signed-off-by: dayshah <dhyey2019@gmail.com>

cursor Bot reviewed Feb 12, 2026

View reviewed changes

Merge branch 'master' into byot-docs

f7a1e84

dayshah requested a review from stephanie-wang February 12, 2026 06:53

cursor Bot reviewed Feb 12, 2026

View reviewed changes

Merge branch 'master' into byot-docs

f34aa58

Signed-off-by: dayshah <dhyey2019@gmail.com>

move into api docstring and restructure main doc

eed7b61

Signed-off-by: dayshah <dhyey2019@gmail.com>

github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 10, 2026

Merge branch 'master' into byot-docs

d36a55e

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 15, 2026

walk through example in docs

77ed9d1

Signed-off-by: dayshah <dhyey2019@gmail.com>

stephanie-wang reviewed Mar 17, 2026

View reviewed changes

dayshah added 3 commits March 24, 2026 13:10

comments

af467e8

Signed-off-by: dayshah <dhyey2019@gmail.com>

Merge branch 'master' into byot-docs

a588274

Merge branch 'master' into byot-docs

b4abced

dayshah requested a review from stephanie-wang April 13, 2026 01:30

address more comments

2d2dfeb

Signed-off-by: dayshah <dhyey2019@gmail.com>

stephanie-wang approved these changes Apr 14, 2026

View reviewed changes

dayshah merged commit f064c8d into ray-project:master Apr 14, 2026
6 checks passed

HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026

[core][rdt] Bring your own transport docs page (ray-project#60308)

f4b939d

Signed-off-by: dayshah <dhyey2019@gmail.com>

Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026

[core][rdt] Bring your own transport docs page (ray-project#60308)

46c5c6e

Signed-off-by: dayshah <dhyey2019@gmail.com>


		- Out-of-order actors If you have an out-of-order actor (such as an async actor) and the process where you submit the actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task submission time.

		For general RDT limitations, see :ref:`limitations <limitations>`.


		The :class:`TensorTransportManager <ray.experimental.TensorTransportManager>` abstract class defines the interface for custom tensor transports. You must implement all abstract methods.

		The following diagram shows when each method is called during a tensor transfer:


		- Registration must happen before actor creation. You must create your actor on the same process where you registered the custom tensor transport. For example, you can't use your custom transport with an actor if you registered the custom transport on the driver and created the actor inside a task.

		- Actors only access transports registered before their creation. If you register a transport after creating an actor, that actor can't use the new transport.


		- Out-of-order actors If you have an out-of-order actor (such as an async actor) and the process where you submit the actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task execution time.

		- Actor creation and task submission from different processes If the process where you submit an actor task is different from where you created the actor, Ray can't guarantee it has registered your custom transport on the actor at task execution time.

		Ray doesn't hold the tensor after returning it to the user on the recv side, so it can be
		garbage collected whenever the user stops using it.

Uh oh!

Conversation

dayshah commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Jan 20, 2026

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sparks0219 Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Sparks0219 Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dayshah Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sparks0219 Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Feb 12, 2026

Choose a reason for hiding this comment

Duplicate limitation bullets about different processes

Choose a reason for hiding this comment

cursor Bot left a comment

Choose a reason for hiding this comment

cursor Bot Feb 12, 2026

Choose a reason for hiding this comment

Duplicate limitation bullet points with identical content

Choose a reason for hiding this comment

github-actions Bot commented Mar 10, 2026

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dayshah commented Jan 20, 2026 •

edited

Loading

Sparks0219 Feb 9, 2026 •

edited

Loading

Sparks0219 Feb 9, 2026 •

edited

Loading

dayshah Feb 12, 2026 •

edited

Loading

Sparks0219 Feb 12, 2026 •

edited

Loading