[Bug]: RayExecutorV2 multi-node DP hangs on shm_broadcast — cross-node ranks can't share single-host shared memory

Summary

RayExecutorV2 (introduced via PR #36836, "[Feat][Executor] Introduce RayExecutorV2") inherits from MultiprocExecutor and uses shm_broadcast for inter-rank communication. shm_broadcast is single-host shared memory — it has no cross-node path natively. For multi-node DP, vLLM falls back to Gloo TCP for the cross-node bits, which times out after ~30 min:

gloo/transport/tcp/unbound_buffer.cc:78 Timed out waiting 1800000ms for recv operation to complete

The pre-RayExecutorV2 ray_executor.py uses pure Ray RPC for ALL collective operations — Ray handles cross-node natively via its own RPC layer, no shm.

Repro

Run any MoE model with data_parallel_size > 1 spanning multiple nodes, leaving VLLM_USE_V2_MODEL_RUNNER=1 (the default). E.g. MiniMax-M2.7-AWQ-4bit on 2× single-node-TP=4 (DP=2 across two 4×H100/GH200 nodes):

python -m vllm.entrypoints.openai.api_server \
  --model cyankiwi/MiniMax-M2.7-AWQ-4bit \
  --tensor-parallel-size 4 --data-parallel-size 2 \
  --data-parallel-backend ray --data-parallel-size-local 1 \
  --data-parallel-address <head_ip> \
  --distributed-executor-backend ray \
  --trust-remote-code --enforce-eager

The job starts, both DPMoEEngineCoreActor instances are created across both nodes, model loads (~11 min), but the first batch's sample_tokens RPC hangs. vllm.log accumulates:

INFO ... [shm_broadcast.py:698] No available shared memory broadcast block found in 600 seconds.
    This typically happens when some processes are hanging or doing some
    time-consuming work (e.g. compilation, weight/kv cache quantization).
[repeated every 10 minutes...]
[W ... socket.cpp:764] [c10d] ... (errno: 97 - Address family not supported by protocol).
ERROR ... [core.py:2178] RuntimeError: ... [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78]
    Timed out waiting 1800000ms for recv operation to complete
ERROR ... RayWorkerProc rank=[3] died unexpectedly, shutting down executor.
ray.exceptions.RayTaskError(RuntimeError)

After this, the engine never recovers; trial throughput drops to ~0 progress per hour. Tested with enable_expert_parallel: true and false — both fail in the same shm_broadcast/Gloo path (EP off shifts the bottleneck from per-token MoE all-to-all to whatever subsequent collective tries to use shm).

Workaround

Setting VLLM_USE_V2_MODEL_RUNNER=0 forces the legacy ray_executor.py path. Confirmed in our environment:

V2 (V1_MODEL_RUNNER=1): job hung at ~12 min, 0 trial progress over 3+ hours, multiple shm_broadcast warnings and final sample_tokens RPC timeout.
V1 (V1_MODEL_RUNNER=0): same model + yaml, ZERO shm_broadcast warnings, ZERO sample_tokens timeouts, ZERO Gloo unbound_buffer timeouts, trials flowing healthily.

Root cause analysis

vllm/v1/executor/ray_executor_v2.py:

```python
from vllm.v1.executor.multiproc_executor import (
FutureWrapper,
MultiprocExecutor,
WorkerProc,
)

...

class RayExecutorV2(MultiprocExecutor):
...
```

By inheriting from MultiprocExecutor, RayExecutorV2 picks up the shm_broadcast-based inter-worker comm path. shm_broadcast uses POSIX shared memory which is fundamentally per-host. For multi-node DP — where workers on different nodes must exchange tensors during collective ops — there's no shm path; the fallback to Gloo TCP has the failure modes shown above.

The legacy vllm/v1/executor/ray_executor.py (which doesn't inherit from MultiprocExecutor) uses Ray RPC for everything. Ray's RPC layer natively handles cross-node transport.

Environment

vLLM commit: 041cfa68e (upstream/main 2026-05-13)
PyTorch: 2.11.0+cu130
aarch64 + CUDA 13 (Jupiter GH200), but the issue is architecturally cross-node, not platform-specific
2 nodes × 4 GPUs each
Model: MiniMax-M2.7 AWQ 4-bit (MoE with 256 experts), but reproduces with any DP>1 multi-node MoE

Fix path suggestions

Short term: Document VLLM_USE_V2_MODEL_RUNNER=0 as the workaround for multi-node DP until V2 supports it.
Medium term: Either (a) make RayExecutorV2 not inherit shm_broadcast for cross-node DP — fall back to Ray RPC for inter-node collectives while keeping shm for intra-node, or (b) gate the V2 selection on a single-node check.
Longer term: Reimplement RayExecutorV2's inter-rank comm using Ray's collective groups / NCCL directly so cross-node DP works without the shm/Gloo dance.

AI-assisted disclosure

This issue write-up was drafted with Claude. The diagnosis and workaround were validated end-to-end against our production workload before posting; no theoretical claims are being made about code paths I did not actually trace through both branches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: RayExecutorV2 multi-node DP hangs on shm_broadcast — cross-node ranks can't share single-host shared memory #43420

Summary

Repro

Workaround

Root cause analysis

...

Environment

Fix path suggestions

AI-assisted disclosure

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[Bug]: RayExecutorV2 multi-node DP hangs on shm_broadcast — cross-node ranks can't share single-host shared memory #43420

Description

Summary

Repro

Workaround

Root cause analysis

...

Environment

Fix path suggestions

AI-assisted disclosure

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions