Skip to content

[Bug] Fix deepseek v4 OOM issue#44914

Merged
vllm-bot merged 3 commits into
mainfrom
wentao-fix-dsv4-oom
Jun 9, 2026
Merged

[Bug] Fix deepseek v4 OOM issue#44914
vllm-bot merged 3 commits into
mainfrom
wentao-fix-dsv4-oom

Conversation

@yewentao256

Copy link
Copy Markdown
Member

Purpose

On H200

vllm serve deepseek-ai/DeepSeek-V4-Pro   --trust-remote-code   --kv-cache-dtype fp8   --block-size 256   --enable-expert-parallel   --tensor-parallel-size 8   --max-model-len 800000   --gpu-memory-utilization 0.95   --max-num-seqs 512   --max-num-batched-tokens 512   --no-enable-flashinfer-autotune   --compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}'

Will raise

(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]     self.ffn = DeepseekV4MoE(vllm_config, prefix=f"{prefix}.ffn")
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]   File "/home/yewentao256/vllm-source/vllm/models/deepseek_v4/nvidia/model.py", line 569, in __init__
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]     self._init_fused_moe_experts(config, quant_config, prefix)
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]   File "/home/yewentao256/vllm-source/vllm/models/deepseek_v4/nvidia/model.py", line 634, in _init_fused_moe_experts
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]     self.experts = FusedMoE(
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]                    ^^^^^^^^^
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]   File "/home/yewentao256/vllm-source/vllm/model_executor/layers/fused_moe/layer.py", line 336, in FusedMoE
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]     routed_experts = routed_experts_cls(
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]                      ^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]   File "/home/yewentao256/vllm-source/vllm/model_executor/layers/fused_moe/routed_experts.py", line 165, in __init__
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]     self.quant_method.create_weights(layer=self, **moe_quant_params)
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]   File "/home/yewentao256/vllm-source/vllm/model_executor/layers/quantization/fp8.py", line 657, in create_weights
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]     torch.empty(
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]   File "/home/yewentao256/.venv/lib/python3.12/site-packages/torch/utils/_device.py", line 116, in __torch_function__
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]     return func(*args, **kwargs)
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=1170423) ERROR 06-08 16:38:17 [multiproc_executor.py:888] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1008.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 979.00 MiB is free. Including non-PyTorch memory, this process has 138.84 GiB memory in use. Of the allocated memory 136.71 GiB is allocated by PyTorch, and 111.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)

PR #41184 introduces the issue as the new class is not considered for DSV4

This PR fixes this bug, now

(APIServer pid=2147715) INFO 06-08 18:33:42 [loggers.py:271] Engine 000: Avg prompt throughput: 6.4 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=2147715) INFO 06-08 18:33:52 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256 yewentao256 requested a review from zyongye as a code owner June 8, 2026 18:42
@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 8, 2026
@mergify mergify Bot added deepseek Related to DeepSeek models bug Something isn't working labels Jun 8, 2026

@sfeng33 sfeng33 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vllm-project vllm-project deleted a comment from mergify Bot Jun 9, 2026
@yewentao256 yewentao256 enabled auto-merge (squash) June 9, 2026 18:39
@vllm-bot vllm-bot merged commit d7607ad into main Jun 9, 2026
38 of 41 checks passed
@vllm-bot vllm-bot deleted the wentao-fix-dsv4-oom branch June 9, 2026 22:47
@khluu khluu added this to the v0.23.0 cherry picks milestone Jun 9, 2026
waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
jasl added a commit to jasl/vllm that referenced this pull request Jun 11, 2026
Upstream vllm-project#44914 carries the runtime fix. Keep a local regression test for the DeepSeek V4 MoE runner refactor path so RoutedExperts continues to use MXFP4 expert quantization.

Signed-off-by: jasl <jasl9187@hotmail.com>
jasl added a commit to jasl/vllm that referenced this pull request Jun 11, 2026
Upstream vllm-project#44914 carries the runtime fix. Keep a local regression test for the DeepSeek V4 MoE runner refactor path so RoutedExperts continues to use MXFP4 expert quantization.

Signed-off-by: jasl <jasl9187@hotmail.com>
jasl added a commit to jasl/vllm that referenced this pull request Jun 11, 2026
Upstream vllm-project#44914 carries the runtime fix. Keep a local regression test for the DeepSeek V4 MoE runner refactor path so RoutedExperts continues to use MXFP4 expert quantization.

Signed-off-by: jasl <jasl9187@hotmail.com>
jasl added a commit to jasl/vllm that referenced this pull request Jun 11, 2026
Upstream vllm-project#44914 carries the runtime fix. Keep a local regression test for the DeepSeek V4 MoE runner refactor path so RoutedExperts continues to use MXFP4 expert quantization.

Signed-off-by: jasl <jasl9187@hotmail.com>
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
Signed-off-by: yewentao256 <zhyanwentao@126.com>
jasl added a commit to jasl/vllm that referenced this pull request Jun 17, 2026
Upstream vllm-project#44914 carries the runtime fix. Keep a local regression test for the DeepSeek V4 MoE runner refactor path so RoutedExperts continues to use MXFP4 expert quantization.

Signed-off-by: jasl <jasl9187@hotmail.com>
vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026
Signed-off-by: yewentao256 <zhyanwentao@126.com>
jasl added a commit to jasl/vllm that referenced this pull request Jun 18, 2026
Upstream vllm-project#44914 carries the runtime fix. Keep a local regression test for the DeepSeek V4 MoE runner refactor path so RoutedExperts continues to use MXFP4 expert quantization.

Signed-off-by: jasl <jasl9187@hotmail.com>
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
Signed-off-by: yewentao256 <zhyanwentao@126.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
Signed-off-by: yewentao256 <zhyanwentao@126.com>
ohsono pushed a commit to ohsono/vllm that referenced this pull request Jul 3, 2026
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed

4 participants