Skip to content

Enable DeepSeek V4 and GLM-5.1 on SM120#43477

Merged
simon-mo merged 28 commits into
vllm-project:mainfrom
lucifer1004:dsv4-sm120-flashinfer
Jun 22, 2026
Merged

Enable DeepSeek V4 and GLM-5.1 on SM120#43477
simon-mo merged 28 commits into
vllm-project:mainfrom
lucifer1004:dsv4-sm120-flashinfer

Conversation

@lucifer1004

@lucifer1004 lucifer1004 commented May 23, 2026

Copy link
Copy Markdown
Contributor

Summary

This draft PR brings up the DeepSeek V4 and GLM-5.1 SM120 path on consumer Blackwell:

  • routes DSv4 sparse MLA through the FlashInfer SM120 sparse MLA wrapper/backend
  • adds SM120 sparse MLA decode/prefill handling for sparse MLA models
  • wires DSv4-specific kernel warmups, including FlashInfer decode autotune during warmup
  • enables DeepGEMM MXFP4 MoE paths and related grouped-GEMM heuristics for SM120
  • updates CMake plumbing for local/fetched DeepGEMM and QuTLASS source trees

To use the version now, you can use the pre-built image at https://hub.docker.com/r/lucifer1004/dsv4-flash-sm120

Blocking Dependencies

This PR is draft-only until the following dependency branches are either merged upstream or replaced by equivalent released/pinned revisions:

Until those branches land upstream, this vLLM branch should be built/tested with the matching local checkouts, e.g. via DEEPGEMM_SRC_DIR for DeepGEMM and a FlashInfer install from the branch above.

Accuracy

  • NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB, SM120)
  • GSM8K Pass@1: 95.0%
  • GPQA-Diamond Pass@1
    • No thinking: 73.2%
    • Max thinking: 87.4% (384K context length, 2 problems failure due to output length cap)

Performance

  • NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB, SM120)
  • Random dataset
  • ISL=8000, OSL=1000

DeepSeek-V4-Flash · TP=2 · No MTP

Concurrency TTFT mean / median / P99 (ms) TPOT mean / median / P99 (ms) Output throughput (tok/s) Request throughput (req/s)
1 718 / 717 / 725 10.65 / 10.65 / 10.65 88.1 0.09
2 1066 / 1076 / 1433 12.93 / 12.92 / 13.29 143.0 0.14
4 1668 / 1769 / 2806 17.26 / 17.12 / 18.32 211.5 0.21
8 2660 / 2973 / 5298 27.82 / 27.72 / 30.63 262.6 0.26
16 3098 / 2327 / 10723 51.42 / 52.50 / 54.51 293.5 0.29
32 4609 / 2352 / 21994 75.60 / 77.74 / 80.36 398.6 0.40

DeepSeek-V4-Flash · TP=2 · MTP=1

Concurrency TTFT mean / median / P99 (ms) TPOT mean / median / P99 (ms) Output throughput (tok/s) Request throughput (req/s) Acceptance rate / length
1 738 / 739 / 741 6.82 / 6.82 / 6.89 132.4 0.13 96.75% / 1.97
2 1102 / 773 / 2147 9.30 / 9.25 / 10.08 192.3 0.19 98.86% / 1.99
4 1303 / 1121 / 2800 13.61 / 13.72 / 15.74 267.6 0.27 98.35% / 1.98
8 1841 / 1545 / 5493 25.63 / 26.01 / 30.41 287.6 0.29 97.14% / 1.97
16 2670 / 1652 / 11028 38.24 / 38.27 / 49.34 388.4 0.39 97.24% / 1.97
32 4825 / 2981 / 22246 50.99 / 51.43 / 69.78 566.3 0.57 97.45% / 1.97

DeepSeek-V4-Flash · TP=2 · MTP=2

Concurrency TTFT mean / median / P99 (ms) TPOT mean / median / P99 (ms) Output throughput (tok/s) Request throughput (req/s) Acceptance rate / length
1 734 / 734 / 736 5.58 / 5.47 / 6.06 158.5 0.16 91.48% / 2.83
2 946 / 756 / 1806 8.58 / 7.91 / 12.04 205.5 0.21 84.26% / 2.69
4 1051 / 787 / 2752 18.96 / 18.48 / 23.37 197.4 0.20 81.07% / 2.62
8 1499 / 897 / 5465 25.24 / 24.03 / 34.63 289.7 0.29 80.68% / 2.61
16 2413 / 924 / 11050 32.34 / 31.32 / 49.13 439.4 0.44 83.91% / 2.68
32 3968 / 1662 / 22397 44.76 / 44.39 / 73.33 633.3 0.63 85.59% / 2.71

GLM-5.1-NVFP4 · TP=8

Concurrency TTFT mean / median / P99 (ms) TPOT mean / median / P99 (ms) Output throughput (tok/s) Request throughput (req/s)
1 1824 / 1822 / 1838 21.07 / 21.06 / 21.20 43.7 0.04
2 2766 / 2764 / 4049 26.44 / 26.60 / 28.12 68.5 0.07
4 4253 / 4582 / 7182 29.72 / 29.43 / 32.16 117.8 0.12
8 5315 / 5365 / 13469 40.34 / 40.30 / 43.93 175.3 0.18
16 6271 / 5370 / 27046 63.19 / 64.11 / 67.64 230.5 0.23
32 8307 / 5386 / 53810 105.57 / 108.57 / 112.04 281.1 0.28

Example Usage

After installing this vLLM branch with the dependency revisions above, the following commands exercise the SM120 sparse MLA path for DeepSeek V4 Flash and GLM-5.1.

#!/usr/bin/env bash
set -euo pipefail

MODEL_TYPE="${1:-dsv4}"
PORT="${PORT:-8000}"

export FLASHINFER_LOGLEVEL="${FLASHINFER_LOGLEVEL:-0}"

case "$MODEL_TYPE" in
  dsv4)
    MODEL_PATH="${MODEL_PATH:-/path/to/DeepSeek-V4-Flash}"
    TP="${TP:-2}"

    exec vllm serve "$MODEL_PATH" \
      --trust-remote-code \
      --host 0.0.0.0 \
      --port "$PORT" \
      --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
      --kv-cache-dtype fp8 \
      --block-size 256 \
      --tensor-parallel-size "$TP" \
      --enable-expert-parallel \
      --gpu-memory-utilization 0.95 \
      --max-model-len 65536 \
      --tokenizer-mode deepseek_v4 \
      --tool-call-parser deepseek_v4 \
      --enable-auto-tool-choice \
      --reasoning-config '{"reasoning_parser":"deepseek_v4","reasoning_start_str":"<think>","reasoning_end_str":"</think>"}' \
      --enable-flashinfer-autotune
    ;;

  glm51)
    MODEL_PATH="${MODEL_PATH:-/path/to/GLM-5.1-NVFP4}"
    TP="${TP:-8}"
    MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
    HF_OVERRIDES='{"index_topk_pattern":"FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSSF"}'

    exec vllm serve "$MODEL_PATH" \
      --served-model-name GLM-5 \
      --trust-remote-code \
      --host 0.0.0.0 \
      --port "$PORT" \
      --tensor-parallel-size "$TP" \
      --enable-chunked-prefill \
      --enable-prefix-caching \
      --gpu-memory-utilization 0.85 \
      --max-model-len "$MAX_MODEL_LEN" \
      --max-num-batched-tokens 8192 \
      --max-num-seqs 64 \
      --kv-cache-dtype fp8 \
      --moe-backend flashinfer_cutlass \
      --tool-call-parser glm47 \
      --enable-auto-tool-choice \
      --reasoning-parser glm45 \
      --attention-backend SPARSE_MLA_SM120 \
      --max-cudagraph-capture-size 256 \
      --hf-overrides "$HF_OVERRIDES" \
      --enable-flashinfer-autotune
    ;;

  *)
    echo "usage: $0 {dsv4|glm51}" >&2
    exit 2
    ;;
esac

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for DeepSeek-V4 sparse-MLA on consumer Blackwell GPUs (SM120/SM121) by adding a new SPARSE_MLA_SM120 backend leveraging FlashInfer. Key changes include the implementation of kernel warmup logic for mHC TileLang and sparse-MLA kernels, refactoring DeepGEMM utilities to support dynamic alignment and memory-efficient scale packing for SM100/SM120, and updating CUDA backend priorities. Review feedback identifies that the warmup logic is currently missing the new backend names in its eligibility check and lacks support for V0 runners, which could result in skipped warmups and JIT-induced latency spikes.

Comment thread vllm/model_executor/warmup/kernel_warmup.py Outdated
Comment thread vllm/model_executor/warmup/kernel_warmup.py Outdated
@lucifer1004 lucifer1004 changed the title [Draft] Add DSv4 SM120 sparse MLA and DeepGEMM support May 23, 2026
@mergify mergify Bot added the deepseek Related to DeepSeek models label May 23, 2026
@lucifer1004 lucifer1004 force-pushed the dsv4-sm120-flashinfer branch from b5d8fdc to 6dc4b92 Compare May 23, 2026 12:05
@mergify

mergify Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor
@mergify mergify Bot added the documentation Improvements or additions to documentation label May 23, 2026
@lucifer1004 lucifer1004 marked this pull request as ready for review May 25, 2026 02:02
@pasta-paul

Copy link
Copy Markdown

Tracking this PR with interest from a downstream that ships DeepSeek-V4 quantization artifacts (canada-quant). We've been validating two flavors of DSv4-Flash on the same hardware (RTX PRO 6000 Server Edition / SM_120) via jasl's fork:

  • NVFP4-FP8-MTP path — clean and production-ready on jasl/vllm@ds4-sm120-preview-dev. AIME-2024 = 83.33%, IFEval prompt-strict = 0.8429, MTP draft acceptance ~79% cumulative. Verified across c=1/2/4/8 thinking-mode.
  • W4A16-FP8-MTP path — same accuracy targets, but the Marlin MoE wna16 decode path on SM_120 has a two-stage problem: (1) PR #40923 fixes the JIT-PTX corruption symptom; (2) once native sm_120a cubins land, a latent c_tmp / workspace OOB (closed PR #36889 analyzed it correctly but couldn't reproduce on A6000) manifests as cudaErrorIllegalAddress.

Two notes that may be relevant to this PR's scope:

  1. MTP-on-SM_120: Your PR routes sparse MLA through the new SPARSE_MLA_SM120 FlashInfer backend and DeepGEMM MXFP4 MoE. It doesn't appear to wire --speculative-config method=mtp. We have working MTP on the NVFP4 path via jasl with num_speculative_tokens=1 (DeepGemm next_n assertion forces k=1 on Hopper/SM12 attention paths). If you'd like a target artifact to validate MTP integration against, our HF repos are public.

  2. MoE backend coverage matrix: Your DeepGEMM MXFP4 path complements but doesn't subsume what NVFP4 (flashinfer_trtllm backend) handles for our use case, and neither covers the W4A16/Marlin wna16 path that downstream community models commonly target. If you anticipate Marlin wna16 staying gated to enterprise SKUs (H100/H200/B300) and SM_120 explicitly routed through CUTLASS/DeepGEMM, that's worth calling out in the PR description so downstreams can plan accordingly.

Happy to test against this branch when its FlashInfer/DeepGEMM dependency branches stabilize. Full repro for both artifacts:

Related: #40923 (our evidence comment), jasl/vllm#12, #43507 (DGX Spark independent SM_12x repro).

@aidendle94

aidendle94 commented May 26, 2026

Copy link
Copy Markdown

For deepseek V4 at around 230k context it will start to error out due to a split k error. I was able to have codex assist me in resolving it upstream within flash infer.

@lucifer1004

lucifer1004 commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

Hi @aidendle94 could you share the exact error or the solution if possible, thanks?

I have seen your fix at aidendle94/flashinfer@bf4fa21, will absorb that.

@mergify

mergify Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lucifer1004.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@philwinder

Copy link
Copy Markdown

Hi @lucifer1004, thank you for your work. I can confirm this also works for me on 6x NVIDIA RTX PRO 6000 Blackwell, CUDA 13.0 with driver 580.142.

I made a couple of changes to make this work. First: --tensor-parallel-size=2 --pipeline-parallel-size=3 to make it work on 6x. I also had to set an env var of MAX_JOBS=8 to stop the first-run CUTLASS NVFP4 kernel JIT from OOM-killing the host.

I think there's more to squeeze out of the gpu-memory-utilization for better concurrency.

Here's the full docker compose for reference and future people:

  glm-51:
    image: lucifer1004/dsv4-flash-sm120:20260604
    container_name: vllm-glm-51
    restart: unless-stopped
    ports:
      - "127.0.0.1:8000:8000"
    volumes:
      - /prod/models:/root/.cache/huggingface
      - /prod/glm-cache:/cache
    environment:
      - HF_TOKEN
      - HF_HUB_OFFLINE=1
      - MAX_JOBS=8
    shm_size: 1g
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["2", "3", "4", "5", "6", "7"]
              capabilities: [gpu]
    command:
      - serve
      - --model
      - nvidia/GLM-5.1-NVFP4
      - --served-model-name
      - "glm-5.1"
      - --trust-remote-code
      - --chat-template-content-format
      - "string"
      - --tensor-parallel-size
      - "2"
      - --pipeline-parallel-size
      - "3"
      - --enable-chunked-prefill
      - --enable-prefix-caching
      - --gpu-memory-utilization
      - "0.85"
      - --max-model-len
      - "131072"
      - --max-num-batched-tokens
      - "8192"
      - --max-num-seqs
      - "64"
      - --kv-cache-dtype
      - "fp8"
      - --moe-backend
      - "flashinfer_cutlass"
      - --tool-call-parser
      - "glm47"
      - --enable-auto-tool-choice
      - --reasoning-parser
      - "glm45"
      - --attention-backend
      - "FLASHINFER_MLA_SPARSE"
      - --max-cudagraph-capture-size
      - "256"
      - --hf-overrides
      - '{"index_topk_pattern":"FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSSF"}'
      - --enable-flashinfer-autotune
@philwinder

Copy link
Copy Markdown

And FYI, the early GLM 5.2 nvpf4 versions do not work because:

head_size=704, FLASHINFER_MLA_SPARSE: head_size not supported

@lucifer1004 would it be possible for you to test this branch against these?

I tried:

Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
@lucifer1004

Copy link
Copy Markdown
Contributor Author

And FYI, the early GLM 5.2 nvpf4 versions do not work because:

head_size=704, FLASHINFER_MLA_SPARSE: head_size not supported

@lucifer1004 would it be possible for you to test this branch against these?

I tried:

Upstream has fixed this issue, so the latest commit (with upstream merged) should run normally.

@eugr

eugr commented Jun 19, 2026

Copy link
Copy Markdown

Was it tested on DGX Spark cluster? Any performance numbers?

@lucifer1004

Copy link
Copy Markdown
Contributor Author

Was it tested on DGX Spark cluster? Any performance numbers?

Yes, and there is a docker image built for Spark: https://hub.docker.com/layers/lucifer1004/dsv4-flash-sm120/sm120-20260528-arm64-nccl2304/images/sha256-415d8957dce6ac44f8a0573fbe8f3c1f7a6b4eb7df75c7040ada9fc8db022172

Perf number is not listed here because the testing setting was a bit different from the other scenarios. Functionality is all good.

@eugr

eugr commented Jun 19, 2026

Copy link
Copy Markdown

@lucifer1004 - thanks. I see that that container is older than the most recent one - anything important that is missing?

@lucifer1004

Copy link
Copy Markdown
Contributor Author

@lucifer1004 - thanks. I see that that container is older than the most recent one - anything important that is missing?

Both x86 and arm images miss an important update in DeepGEMM. Will upload new images recently.

@eugr

eugr commented Jun 19, 2026

Copy link
Copy Markdown

EDIT: please ignore, just seen your previous message - will rebuild from the source with latest DeepGEMM updates and test.

@zyongye zyongye modified the milestone: v0.23.0 cherry picks Jun 20, 2026
@simon-mo simon-mo merged commit 44d9506 into vllm-project:main Jun 22, 2026
207 of 211 checks passed
@eugr

eugr commented Jun 22, 2026

Copy link
Copy Markdown

@simon-mo, @lucifer1004 - now that it is merged, do we still have an unresolved dependency on DeepGEMM or it's been resolved too?

@ormandj

ormandj commented Jun 24, 2026

Copy link
Copy Markdown

@lucifer1004 I see the DeepGEMM received a merge into the nv-dev branch there, does VLLM's pin need to be updated now so models that are impacted will start working again from vllm main? Thank you for all of your efforts!

@mglaubitz

Copy link
Copy Markdown

@lucifer1004 I see the DeepGEMM received a merge into the nv-dev branch there, does VLLM's pin need to be updated now so models that are impacted will start working again from vllm main? Thank you for all of your efforts!

I guess that is the DeepGemm error you were talking about:

`(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] WorkerProc hit an exception.
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] Traceback (most recent call last):
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 992, in worker_busy_loop
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] output = func(*args, **kwargs)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return func(*args, **kwargs)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 460, in determine_available_memory
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return func(*args, **kwargs)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6531, in profile_cudagraph_memory
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] self._warmup_and_capture(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6692, in _warmup_and_capture
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] self._dummy_run(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return func(*args, **kwargs)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5896, in _dummy_run
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] attn_metadata, _ = self._build_attention_metadata(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2513, in _build_attention_metadata
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2447, in _build_attn_group_metadata
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] attn_metadata_i = builder.build(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/indexer.py", line 617, in build
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] self.scheduler_metadata_buffer[:] = get_paged_mqa_logits_metadata(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/deep_gemm.py", line 562, in get_paged_mqa_logits_metadata
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return _get_paged_mqa_logits_metadata_impl(context_lens, block_size, num_sms)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] RuntimeError: Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:219): Unsupported architecture
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] Traceback (most recent call last):
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 992, in worker_busy_loop
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] output = func(*args, **kwargs)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return func(*args, **kwargs)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 460, in determine_available_memory
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return func(*args, **kwargs)

(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6692, in _warmup_and_capture 13:23:22 [106/730]
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] self._dummy_run(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return func(*args, **kwargs)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5896, in _dummy_run
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] attn_metadata, _ = self._build_attention_metadata(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2513, in _build_attention_metadata
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2447, in _build_attn_group_metadata
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] attn_metadata_i = builder.build(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/indexer.py", line 617, in build
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] self.scheduler_metadata_buffer[:] = get_paged_mqa_logits_metadata(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/deep_gemm.py", line 562, in get_paged_mqa_logits_metadata
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return _get_paged_mqa_logits_metadata_impl(context_lens, block_size, num_sms)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] RuntimeError: Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:219): Unsupported architecture
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000]
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] EngineCore failed to start.
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] Traceback (most recent call last):
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1200, in run_engine_core
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] return func(*args, **kwargs)
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 966, in init
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] super().init(
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 133, in init
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] return func(*args, **kwargs)
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 283, in _initialize_kv_caches
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] return self.collective_rpc("determine_available_memory")
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 402, in collective_rpc
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] return future if non_block else future.result()
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 91, in result
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] return super().result()
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] return self.__get_result()
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] raise self._exception
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 95, in _wait_for_response
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] response = self.aggregate(self.get_response())
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 391, in get_response
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] raise RuntimeError(
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] RuntimeError: Worker failed with error 'Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:219): Unsupported architecture', please check the stack trace above for the root cause
(EngineCore pid=2150) ERROR 06-29 11:22:33 [multiproc_executor.py:284] Worker proc VllmWorker-7 died unexpectedly, shutting down executor.`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation nvidia ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1 verified Run pre-commit for new contributors without triggering other tests