Summary
RayExecutorV2 (introduced via PR #36836, "[Feat][Executor] Introduce RayExecutorV2") inherits from MultiprocExecutor and uses shm_broadcast for inter-rank communication. shm_broadcast is single-host shared memory — it has no cross-node path natively. For multi-node DP, vLLM falls back to Gloo TCP for the cross-node bits, which times out after ~30 min:
gloo/transport/tcp/unbound_buffer.cc:78 Timed out waiting 1800000ms for recv operation to complete
The pre-RayExecutorV2 ray_executor.py uses pure Ray RPC for ALL collective operations — Ray handles cross-node natively via its own RPC layer, no shm.
Repro
Run any MoE model with data_parallel_size > 1 spanning multiple nodes, leaving VLLM_USE_V2_MODEL_RUNNER=1 (the default). E.g. MiniMax-M2.7-AWQ-4bit on 2× single-node-TP=4 (DP=2 across two 4×H100/GH200 nodes):
python -m vllm.entrypoints.openai.api_server \
--model cyankiwi/MiniMax-M2.7-AWQ-4bit \
--tensor-parallel-size 4 --data-parallel-size 2 \
--data-parallel-backend ray --data-parallel-size-local 1 \
--data-parallel-address <head_ip> \
--distributed-executor-backend ray \
--trust-remote-code --enforce-eager
The job starts, both DPMoEEngineCoreActor instances are created across both nodes, model loads (~11 min), but the first batch's sample_tokens RPC hangs. vllm.log accumulates:
INFO ... [shm_broadcast.py:698] No available shared memory broadcast block found in 600 seconds.
This typically happens when some processes are hanging or doing some
time-consuming work (e.g. compilation, weight/kv cache quantization).
[repeated every 10 minutes...]
[W ... socket.cpp:764] [c10d] ... (errno: 97 - Address family not supported by protocol).
ERROR ... [core.py:2178] RuntimeError: ... [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78]
Timed out waiting 1800000ms for recv operation to complete
ERROR ... RayWorkerProc rank=[3] died unexpectedly, shutting down executor.
ray.exceptions.RayTaskError(RuntimeError)
After this, the engine never recovers; trial throughput drops to ~0 progress per hour. Tested with enable_expert_parallel: true and false — both fail in the same shm_broadcast/Gloo path (EP off shifts the bottleneck from per-token MoE all-to-all to whatever subsequent collective tries to use shm).
Workaround
Setting VLLM_USE_V2_MODEL_RUNNER=0 forces the legacy ray_executor.py path. Confirmed in our environment:
- V2 (V1_MODEL_RUNNER=1): job hung at ~12 min, 0 trial progress over 3+ hours, multiple
shm_broadcast warnings and final sample_tokens RPC timeout.
- V1 (V1_MODEL_RUNNER=0): same model + yaml, ZERO
shm_broadcast warnings, ZERO sample_tokens timeouts, ZERO Gloo unbound_buffer timeouts, trials flowing healthily.
Root cause analysis
vllm/v1/executor/ray_executor_v2.py:
```python
from vllm.v1.executor.multiproc_executor import (
FutureWrapper,
MultiprocExecutor,
WorkerProc,
)
...
class RayExecutorV2(MultiprocExecutor):
...
```
By inheriting from MultiprocExecutor, RayExecutorV2 picks up the shm_broadcast-based inter-worker comm path. shm_broadcast uses POSIX shared memory which is fundamentally per-host. For multi-node DP — where workers on different nodes must exchange tensors during collective ops — there's no shm path; the fallback to Gloo TCP has the failure modes shown above.
The legacy vllm/v1/executor/ray_executor.py (which doesn't inherit from MultiprocExecutor) uses Ray RPC for everything. Ray's RPC layer natively handles cross-node transport.
Environment
- vLLM commit:
041cfa68e (upstream/main 2026-05-13)
- PyTorch: 2.11.0+cu130
- aarch64 + CUDA 13 (Jupiter GH200), but the issue is architecturally cross-node, not platform-specific
- 2 nodes × 4 GPUs each
- Model: MiniMax-M2.7 AWQ 4-bit (MoE with 256 experts), but reproduces with any DP>1 multi-node MoE
Fix path suggestions
- Short term: Document
VLLM_USE_V2_MODEL_RUNNER=0 as the workaround for multi-node DP until V2 supports it.
- Medium term: Either (a) make
RayExecutorV2 not inherit shm_broadcast for cross-node DP — fall back to Ray RPC for inter-node collectives while keeping shm for intra-node, or (b) gate the V2 selection on a single-node check.
- Longer term: Reimplement
RayExecutorV2's inter-rank comm using Ray's collective groups / NCCL directly so cross-node DP works without the shm/Gloo dance.
AI-assisted disclosure
This issue write-up was drafted with Claude. The diagnosis and workaround were validated end-to-end against our production workload before posting; no theoretical claims are being made about code paths I did not actually trace through both branches.
Summary
RayExecutorV2(introduced via PR #36836, "[Feat][Executor] Introduce RayExecutorV2") inherits fromMultiprocExecutorand usesshm_broadcastfor inter-rank communication.shm_broadcastis single-host shared memory — it has no cross-node path natively. For multi-node DP, vLLM falls back to Gloo TCP for the cross-node bits, which times out after ~30 min:The pre-
RayExecutorV2ray_executor.pyuses pure Ray RPC for ALL collective operations — Ray handles cross-node natively via its own RPC layer, no shm.Repro
Run any MoE model with
data_parallel_size > 1spanning multiple nodes, leavingVLLM_USE_V2_MODEL_RUNNER=1(the default). E.g. MiniMax-M2.7-AWQ-4bit on 2× single-node-TP=4 (DP=2 across two 4×H100/GH200 nodes):The job starts, both
DPMoEEngineCoreActorinstances are created across both nodes, model loads (~11 min), but the first batch'ssample_tokensRPC hangs.vllm.logaccumulates:After this, the engine never recovers; trial throughput drops to ~0 progress per hour. Tested with
enable_expert_parallel: trueandfalse— both fail in the same shm_broadcast/Gloo path (EP off shifts the bottleneck from per-token MoE all-to-all to whatever subsequent collective tries to use shm).Workaround
Setting
VLLM_USE_V2_MODEL_RUNNER=0forces the legacyray_executor.pypath. Confirmed in our environment:shm_broadcastwarnings and finalsample_tokensRPC timeout.shm_broadcastwarnings, ZEROsample_tokenstimeouts, ZERO Gloounbound_buffertimeouts, trials flowing healthily.Root cause analysis
vllm/v1/executor/ray_executor_v2.py:```python
from vllm.v1.executor.multiproc_executor import (
FutureWrapper,
MultiprocExecutor,
WorkerProc,
)
...
class RayExecutorV2(MultiprocExecutor):
...
```
By inheriting from
MultiprocExecutor,RayExecutorV2picks up theshm_broadcast-based inter-worker comm path.shm_broadcastuses POSIX shared memory which is fundamentally per-host. For multi-node DP — where workers on different nodes must exchange tensors during collective ops — there's no shm path; the fallback to Gloo TCP has the failure modes shown above.The legacy
vllm/v1/executor/ray_executor.py(which doesn't inherit fromMultiprocExecutor) uses Ray RPC for everything. Ray's RPC layer natively handles cross-node transport.Environment
041cfa68e(upstream/main 2026-05-13)Fix path suggestions
VLLM_USE_V2_MODEL_RUNNER=0as the workaround for multi-node DP until V2 supports it.RayExecutorV2not inheritshm_broadcastfor cross-node DP — fall back to Ray RPC for inter-node collectives while keeping shm for intra-node, or (b) gate the V2 selection on a single-node check.RayExecutorV2's inter-rank comm using Ray's collective groups / NCCL directly so cross-node DP works without the shm/Gloo dance.AI-assisted disclosure
This issue write-up was drafted with Claude. The diagnosis and workaround were validated end-to-end against our production workload before posting; no theoretical claims are being made about code paths I did not actually trace through both branches.