Enable DeepSeek V4 and GLM-5.1 on SM120 by lucifer1004 · Pull Request #43477 · vllm-project/vllm

lucifer1004 · 2026-05-23T10:56:42Z

Summary

This draft PR brings up the DeepSeek V4 and GLM-5.1 SM120 path on consumer Blackwell:

routes DSv4 sparse MLA through the FlashInfer SM120 sparse MLA wrapper/backend
adds SM120 sparse MLA decode/prefill handling for sparse MLA models
wires DSv4-specific kernel warmups, including FlashInfer decode autotune during warmup
enables DeepGEMM MXFP4 MoE paths and related grouped-GEMM heuristics for SM120
updates CMake plumbing for local/fetched DeepGEMM and QuTLASS source trees

To use the version now, you can use the pre-built image at https://hub.docker.com/r/lucifer1004/dsv4-flash-sm120

Blocking Dependencies

This PR is draft-only until the following dependency branches are either merged upstream or replaced by equivalent released/pinned revisions:

FlashInfer SM120 sparse MLA (feat(attention): add SM120 sparse MLA kernels flashinfer-ai/flashinfer#3395)
DeepGEMM SM120 / MXFP4 (feat: add sm120 support for DeepGEMM deepseek-ai/DeepGEMM#324)

Until those branches land upstream, this vLLM branch should be built/tested with the matching local checkouts, e.g. via DEEPGEMM_SRC_DIR for DeepGEMM and a FlashInfer install from the branch above.

Accuracy

NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB, SM120)
GSM8K Pass@1: 95.0%
GPQA-Diamond Pass@1
- No thinking: 73.2%
- Max thinking: 87.4% (384K context length, 2 problems failure due to output length cap)

Performance

NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB, SM120)
Random dataset
ISL=8000, OSL=1000

DeepSeek-V4-Flash · TP=2 · No MTP

Concurrency	TTFT mean / median / P99 (ms)	TPOT mean / median / P99 (ms)	Output throughput (tok/s)	Request throughput (req/s)
1	718 / 717 / 725	10.65 / 10.65 / 10.65	88.1	0.09
2	1066 / 1076 / 1433	12.93 / 12.92 / 13.29	143.0	0.14
4	1668 / 1769 / 2806	17.26 / 17.12 / 18.32	211.5	0.21
8	2660 / 2973 / 5298	27.82 / 27.72 / 30.63	262.6	0.26
16	3098 / 2327 / 10723	51.42 / 52.50 / 54.51	293.5	0.29
32	4609 / 2352 / 21994	75.60 / 77.74 / 80.36	398.6	0.40

DeepSeek-V4-Flash · TP=2 · MTP=1

Concurrency	TTFT mean / median / P99 (ms)	TPOT mean / median / P99 (ms)	Output throughput (tok/s)	Request throughput (req/s)	Acceptance rate / length
1	738 / 739 / 741	6.82 / 6.82 / 6.89	132.4	0.13	96.75% / 1.97
2	1102 / 773 / 2147	9.30 / 9.25 / 10.08	192.3	0.19	98.86% / 1.99
4	1303 / 1121 / 2800	13.61 / 13.72 / 15.74	267.6	0.27	98.35% / 1.98
8	1841 / 1545 / 5493	25.63 / 26.01 / 30.41	287.6	0.29	97.14% / 1.97
16	2670 / 1652 / 11028	38.24 / 38.27 / 49.34	388.4	0.39	97.24% / 1.97
32	4825 / 2981 / 22246	50.99 / 51.43 / 69.78	566.3	0.57	97.45% / 1.97

DeepSeek-V4-Flash · TP=2 · MTP=2

Concurrency	TTFT mean / median / P99 (ms)	TPOT mean / median / P99 (ms)	Output throughput (tok/s)	Request throughput (req/s)	Acceptance rate / length
1	734 / 734 / 736	5.58 / 5.47 / 6.06	158.5	0.16	91.48% / 2.83
2	946 / 756 / 1806	8.58 / 7.91 / 12.04	205.5	0.21	84.26% / 2.69
4	1051 / 787 / 2752	18.96 / 18.48 / 23.37	197.4	0.20	81.07% / 2.62
8	1499 / 897 / 5465	25.24 / 24.03 / 34.63	289.7	0.29	80.68% / 2.61
16	2413 / 924 / 11050	32.34 / 31.32 / 49.13	439.4	0.44	83.91% / 2.68
32	3968 / 1662 / 22397	44.76 / 44.39 / 73.33	633.3	0.63	85.59% / 2.71

GLM-5.1-NVFP4 · TP=8

Concurrency	TTFT mean / median / P99 (ms)	TPOT mean / median / P99 (ms)	Output throughput (tok/s)	Request throughput (req/s)
1	1824 / 1822 / 1838	21.07 / 21.06 / 21.20	43.7	0.04
2	2766 / 2764 / 4049	26.44 / 26.60 / 28.12	68.5	0.07
4	4253 / 4582 / 7182	29.72 / 29.43 / 32.16	117.8	0.12
8	5315 / 5365 / 13469	40.34 / 40.30 / 43.93	175.3	0.18
16	6271 / 5370 / 27046	63.19 / 64.11 / 67.64	230.5	0.23
32	8307 / 5386 / 53810	105.57 / 108.57 / 112.04	281.1	0.28

Example Usage

After installing this vLLM branch with the dependency revisions above, the following commands exercise the SM120 sparse MLA path for DeepSeek V4 Flash and GLM-5.1.

#!/usr/bin/env bash
set -euo pipefail

MODEL_TYPE="${1:-dsv4}"
PORT="${PORT:-8000}"

export FLASHINFER_LOGLEVEL="${FLASHINFER_LOGLEVEL:-0}"

case "$MODEL_TYPE" in
  dsv4)
    MODEL_PATH="${MODEL_PATH:-/path/to/DeepSeek-V4-Flash}"
    TP="${TP:-2}"

    exec vllm serve "$MODEL_PATH" \
      --trust-remote-code \
      --host 0.0.0.0 \
      --port "$PORT" \
      --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
      --kv-cache-dtype fp8 \
      --block-size 256 \
      --tensor-parallel-size "$TP" \
      --enable-expert-parallel \
      --gpu-memory-utilization 0.95 \
      --max-model-len 65536 \
      --tokenizer-mode deepseek_v4 \
      --tool-call-parser deepseek_v4 \
      --enable-auto-tool-choice \
      --reasoning-config '{"reasoning_parser":"deepseek_v4","reasoning_start_str":"<think>","reasoning_end_str":"</think>"}' \
      --enable-flashinfer-autotune
    ;;

  glm51)
    MODEL_PATH="${MODEL_PATH:-/path/to/GLM-5.1-NVFP4}"
    TP="${TP:-8}"
    MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
    HF_OVERRIDES='{"index_topk_pattern":"FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSSF"}'

    exec vllm serve "$MODEL_PATH" \
      --served-model-name GLM-5 \
      --trust-remote-code \
      --host 0.0.0.0 \
      --port "$PORT" \
      --tensor-parallel-size "$TP" \
      --enable-chunked-prefill \
      --enable-prefix-caching \
      --gpu-memory-utilization 0.85 \
      --max-model-len "$MAX_MODEL_LEN" \
      --max-num-batched-tokens 8192 \
      --max-num-seqs 64 \
      --kv-cache-dtype fp8 \
      --moe-backend flashinfer_cutlass \
      --tool-call-parser glm47 \
      --enable-auto-tool-choice \
      --reasoning-parser glm45 \
      --attention-backend SPARSE_MLA_SM120 \
      --max-cudagraph-capture-size 256 \
      --hf-overrides "$HF_OVERRIDES" \
      --enable-flashinfer-autotune
    ;;

  *)
    echo "usage: $0 {dsv4|glm51}" >&2
    exit 2
    ;;
esac

gemini-code-assist

Code Review

This pull request introduces support for DeepSeek-V4 sparse-MLA on consumer Blackwell GPUs (SM120/SM121) by adding a new SPARSE_MLA_SM120 backend leveraging FlashInfer. Key changes include the implementation of kernel warmup logic for mHC TileLang and sparse-MLA kernels, refactoring DeepGEMM utilities to support dynamic alignment and memory-efficient scale packing for SM100/SM120, and updating CUDA backend priorities. Review feedback identifies that the warmup logic is currently missing the new backend names in its eligibility check and lacks support for V0 runners, which could result in skipped warmups and JIT-induced latency spikes.

mergify · 2026-05-23T23:16:19Z

Documentation preview: https://vllm--43477.org.readthedocs.build/en/43477/

pasta-paul · 2026-05-25T03:11:43Z

Tracking this PR with interest from a downstream that ships DeepSeek-V4 quantization artifacts (canada-quant). We've been validating two flavors of DSv4-Flash on the same hardware (RTX PRO 6000 Server Edition / SM_120) via jasl's fork:

NVFP4-FP8-MTP path — clean and production-ready on jasl/vllm@ds4-sm120-preview-dev. AIME-2024 = 83.33%, IFEval prompt-strict = 0.8429, MTP draft acceptance ~79% cumulative. Verified across c=1/2/4/8 thinking-mode.
W4A16-FP8-MTP path — same accuracy targets, but the Marlin MoE wna16 decode path on SM_120 has a two-stage problem: (1) PR #40923 fixes the JIT-PTX corruption symptom; (2) once native sm_120a cubins land, a latent c_tmp / workspace OOB (closed PR #36889 analyzed it correctly but couldn't reproduce on A6000) manifests as cudaErrorIllegalAddress.

Two notes that may be relevant to this PR's scope:

MTP-on-SM_120: Your PR routes sparse MLA through the new SPARSE_MLA_SM120 FlashInfer backend and DeepGEMM MXFP4 MoE. It doesn't appear to wire --speculative-config method=mtp. We have working MTP on the NVFP4 path via jasl with num_speculative_tokens=1 (DeepGemm next_n assertion forces k=1 on Hopper/SM12 attention paths). If you'd like a target artifact to validate MTP integration against, our HF repos are public.
MoE backend coverage matrix: Your DeepGEMM MXFP4 path complements but doesn't subsume what NVFP4 (flashinfer_trtllm backend) handles for our use case, and neither covers the W4A16/Marlin wna16 path that downstream community models commonly target. If you anticipate Marlin wna16 staying gated to enterprise SKUs (H100/H200/B300) and SM_120 explicitly routed through CUTLASS/DeepGEMM, that's worth calling out in the PR description so downstreams can plan accordingly.

Happy to test against this branch when its FlashInfer/DeepGEMM dependency branches stabilize. Full repro for both artifacts:

Related: #40923 (our evidence comment), jasl/vllm#12, #43507 (DGX Spark independent SM_12x repro).

aidendle94 · 2026-05-26T20:47:05Z

For deepseek V4 at around 230k context it will start to error out due to a split k error. I was able to have codex assist me in resolving it upstream within flash infer.

lucifer1004 · 2026-05-26T22:44:19Z

~~Hi @aidendle94 could you share the exact error or the solution if possible, thanks?~~

I have seen your fix at aidendle94/flashinfer@bf4fa21, will absorb that.

mergify · 2026-05-27T00:20:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lucifer1004.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

philwinder · 2026-06-18T08:32:08Z

Hi @lucifer1004, thank you for your work. I can confirm this also works for me on 6x NVIDIA RTX PRO 6000 Blackwell, CUDA 13.0 with driver 580.142.

I made a couple of changes to make this work. First: --tensor-parallel-size=2 --pipeline-parallel-size=3 to make it work on 6x. I also had to set an env var of MAX_JOBS=8 to stop the first-run CUTLASS NVFP4 kernel JIT from OOM-killing the host.

I think there's more to squeeze out of the gpu-memory-utilization for better concurrency.

Here's the full docker compose for reference and future people:

  glm-51:
    image: lucifer1004/dsv4-flash-sm120:20260604
    container_name: vllm-glm-51
    restart: unless-stopped
    ports:
      - "127.0.0.1:8000:8000"
    volumes:
      - /prod/models:/root/.cache/huggingface
      - /prod/glm-cache:/cache
    environment:
      - HF_TOKEN
      - HF_HUB_OFFLINE=1
      - MAX_JOBS=8
    shm_size: 1g
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["2", "3", "4", "5", "6", "7"]
              capabilities: [gpu]
    command:
      - serve
      - --model
      - nvidia/GLM-5.1-NVFP4
      - --served-model-name
      - "glm-5.1"
      - --trust-remote-code
      - --chat-template-content-format
      - "string"
      - --tensor-parallel-size
      - "2"
      - --pipeline-parallel-size
      - "3"
      - --enable-chunked-prefill
      - --enable-prefix-caching
      - --gpu-memory-utilization
      - "0.85"
      - --max-model-len
      - "131072"
      - --max-num-batched-tokens
      - "8192"
      - --max-num-seqs
      - "64"
      - --kv-cache-dtype
      - "fp8"
      - --moe-backend
      - "flashinfer_cutlass"
      - --tool-call-parser
      - "glm47"
      - --enable-auto-tool-choice
      - --reasoning-parser
      - "glm45"
      - --attention-backend
      - "FLASHINFER_MLA_SPARSE"
      - --max-cudagraph-capture-size
      - "256"
      - --hf-overrides
      - '{"index_topk_pattern":"FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSSF"}'
      - --enable-flashinfer-autotune

philwinder · 2026-06-18T09:44:45Z

And FYI, the early GLM 5.2 nvpf4 versions do not work because:

head_size=704, FLASHINFER_MLA_SPARSE: head_size not supported

@lucifer1004 would it be possible for you to test this branch against these?

I tried:

Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>

lucifer1004 · 2026-06-18T16:45:15Z

And FYI, the early GLM 5.2 nvpf4 versions do not work because:

head_size=704, FLASHINFER_MLA_SPARSE: head_size not supported

@lucifer1004 would it be possible for you to test this branch against these?

I tried:

https://huggingface.co/Mapika/GLM-5.2-NVFP4

https://huggingface.co/Lorbus/GLM-5.2-NVFP4

Upstream has fixed this issue, so the latest commit (with upstream merged) should run normally.

eugr · 2026-06-19T05:26:53Z

Was it tested on DGX Spark cluster? Any performance numbers?

lucifer1004 · 2026-06-19T06:45:40Z

Was it tested on DGX Spark cluster? Any performance numbers?

Yes, and there is a docker image built for Spark: https://hub.docker.com/layers/lucifer1004/dsv4-flash-sm120/sm120-20260528-arm64-nccl2304/images/sha256-415d8957dce6ac44f8a0573fbe8f3c1f7a6b4eb7df75c7040ada9fc8db022172

Perf number is not listed here because the testing setting was a bit different from the other scenarios. Functionality is all good.

eugr · 2026-06-19T06:57:34Z

@lucifer1004 - thanks. I see that that container is older than the most recent one - anything important that is missing?

lucifer1004 · 2026-06-19T09:09:03Z

@lucifer1004 - thanks. I see that that container is older than the most recent one - anything important that is missing?

Both x86 and arm images miss an important update in DeepGEMM. Will upload new images recently.

eugr · 2026-06-19T15:17:44Z

EDIT: please ignore, just seen your previous message - will rebuild from the source with latest DeepGEMM updates and test.

eugr · 2026-06-22T18:56:47Z

@simon-mo, @lucifer1004 - now that it is merged, do we still have an unresolved dependency on DeepGEMM or it's been resolved too?

ormandj · 2026-06-24T12:22:47Z

@lucifer1004 I see the DeepGEMM received a merge into the nv-dev branch there, does VLLM's pin need to be updated now so models that are impacted will start working again from vllm main? Thank you for all of your efforts!

mglaubitz · 2026-06-29T11:25:50Z

@lucifer1004 I see the DeepGEMM received a merge into the nv-dev branch there, does VLLM's pin need to be updated now so models that are impacted will start working again from vllm main? Thank you for all of your efforts!

I guess that is the DeepGemm error you were talking about:

`(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] WorkerProc hit an exception.
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] Traceback (most recent call last):
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 992, in worker_busy_loop
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] output = func(*args, **kwargs)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return func(*args, **kwargs)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 460, in determine_available_memory
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return func(*args, **kwargs)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6531, in profile_cudagraph_memory
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] self._warmup_and_capture(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6692, in _warmup_and_capture
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] self._dummy_run(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return func(*args, **kwargs)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5896, in _dummy_run
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] attn_metadata, _ = self._build_attention_metadata(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2513, in _build_attention_metadata
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2447, in _build_attn_group_metadata
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] attn_metadata_i = builder.build(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/indexer.py", line 617, in build
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] self.scheduler_metadata_buffer[:] = get_paged_mqa_logits_metadata(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/deep_gemm.py", line 562, in get_paged_mqa_logits_metadata
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return _get_paged_mqa_logits_metadata_impl(context_lens, block_size, num_sms)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] RuntimeError: Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:219): Unsupported architecture
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] Traceback (most recent call last):
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 992, in worker_busy_loop
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] output = func(*args, **kwargs)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return func(*args, **kwargs)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 460, in determine_available_memory
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return func(*args, **kwargs)

(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6692, in _warmup_and_capture 13:23:22 [106/730]
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] self._dummy_run(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return func(*args, **kwargs)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5896, in _dummy_run
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] attn_metadata, _ = self._build_attention_metadata(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2513, in _build_attention_metadata
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2447, in _build_attn_group_metadata
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] attn_metadata_i = builder.build(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/indexer.py", line 617, in build
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] self.scheduler_metadata_buffer[:] = get_paged_mqa_logits_metadata(
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/deep_gemm.py", line 562, in get_paged_mqa_logits_metadata
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] return _get_paged_mqa_logits_metadata_impl(context_lens, block_size, num_sms)
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] RuntimeError: Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:219): Unsupported architecture
(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000]
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] EngineCore failed to start.
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] Traceback (most recent call last):
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1200, in run_engine_core
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] return func(*args, **kwargs)
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 966, in init
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] super().init(
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 133, in init
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] return func(*args, **kwargs)
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 283, in _initialize_kv_caches
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] return self.collective_rpc("determine_available_memory")
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 402, in collective_rpc
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] return future if non_block else future.result()
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 91, in result
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] return super().result()
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] return self.__get_result()
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] raise self._exception
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 95, in _wait_for_response
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] response = self.aggregate(self.get_response())
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 391, in get_response
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] raise RuntimeError(
(EngineCore pid=2150) ERROR 06-29 11:22:29 [core.py:1231] RuntimeError: Worker failed with error 'Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/attention.hpp:219): Unsupported architecture', please check the stack trace above for the root cause
(EngineCore pid=2150) ERROR 06-29 11:22:33 [multiproc_executor.py:284] Worker proc VllmWorker-7 died unexpectedly, shutting down executor.`

mergify Bot added ci/build nvidia v1 labels May 23, 2026

github-project-automation Bot added this to NVIDIA May 23, 2026

gemini-code-assist Bot reviewed May 23, 2026

View reviewed changes

Comment thread vllm/model_executor/warmup/kernel_warmup.py Outdated

Comment thread vllm/model_executor/warmup/kernel_warmup.py Outdated

lucifer1004 changed the title ~~[Draft] Add DSv4 SM120 sparse MLA and DeepGEMM support~~ May 23, 2026

mergify Bot added the deepseek Related to DeepSeek models label May 23, 2026

lucifer1004 force-pushed the dsv4-sm120-flashinfer branch from b5d8fdc to 6dc4b92 Compare May 23, 2026 12:05

mergify Bot added the documentation Improvements or additions to documentation label May 23, 2026

lucifer1004 marked this pull request as ready for review May 25, 2026 02:02

lucifer1004 requested review from Harry-Chen, LucasWilkinson, MatthewBonanni, mgoin, njhill, pavanimajety, tlrmchlsmth and zyongye as code owners May 25, 2026 02:02

zyongye self-assigned this May 26, 2026

pasta-paul mentioned this pull request May 26, 2026

[Bug]: Deepseek V4 failed to load on RTX PRO 6000 #40821

Open

mergify Bot added the needs-rebase label May 27, 2026

mratsim mentioned this pull request May 27, 2026

[Killer app] Implement DeepSeek v4 Flash for SM120 mratsim/tattletale#28

Open

mergify Bot removed the needs-rebase label May 27, 2026

lucifer1004 force-pushed the dsv4-sm120-flashinfer branch from a512579 to da05721 Compare May 27, 2026 13:53

tgmerritt mentioned this pull request May 27, 2026

[Bugfix][SM120] Enable CUTLASS grouped GEMM (MoE) for SM_120/SM_121 consumer Blackwell #43814

Open

Merge remote-tracking branch 'upstream/main' into dsv4-sm120-flashinfer

40c6120

Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>

Merge branch 'main' into dsv4-sm120-flashinfer

f519765

Merge branch 'main' into dsv4-sm120-flashinfer

0ef38f0

Merge branch 'main' into dsv4-sm120-flashinfer

ab6e27e

zyongye modified the milestone: v0.23.0 cherry picks Jun 20, 2026

Merge branch 'main' into dsv4-sm120-flashinfer

7d8766f

simon-mo merged commit 44d9506 into vllm-project:main Jun 22, 2026
207 of 211 checks passed

ormandj mentioned this pull request Jun 24, 2026

feat: add sm120 support for DeepGEMM deepseek-ai/DeepGEMM#324

Merged

This was referenced Jun 25, 2026

DeepSeek V4 Flash on GB10: stable SM12x sparse-MLA path pending timothystewart6/vllm-gb10#26

Open

DeepGEMM/MXFP4 on GB10: SM12x scale-layout/JIT path fails for DSv4 Flash timothystewart6/vllm-gb10#27

Open

ehfd mentioned this pull request Jun 29, 2026

[Feature] TRITON_MLA_SPARSE backend for SM8x/11x/12x DSA Sparse MLA Support #38476

Open

pxljs mentioned this pull request Jun 29, 2026

[Bug]: sparse-MLA indexer error with GLM 5.2 NVFP4 on RTX 6000 Pro SM120 #46726

Open

1 task

yilin-void mentioned this pull request Jun 30, 2026

feat: SM120 (Blackwell Desktop) support for GLM-5.1 inference sgl-project/sglang#26928

Open

volkanncicek mentioned this pull request Jun 30, 2026

Version bump request: vLLM v0.23.1rc0 -> v0.24.0 timothystewart6/vllm-gb10#34

Closed

timothystewart6 mentioned this pull request Jun 30, 2026

chore(deps): bump vLLM v0.24.0, uv 0.11.26, apt snapshot 2026-06-30 timothystewart6/vllm-gb10#35

Merged

This was referenced Jul 1, 2026

[Bug] v0.24.0: DeepGEMM "Unknown recipe" assertion in FP8 kernel warmup on Blackwell (sm_120) — regression vs 0.23.0 #47130

Open

Update DeepGEMM tag to point to latest nv-dev branch for sm120 support #47304

Merged

waynehacking8 mentioned this pull request Jul 2, 2026

[Bug]: Block-scaled FP8 (compressed-tensors W8A8) crashes on load on SM120 Blackwell (RTX PRO 6000), v0.24.0 — DeepGEMM "Unknown SF transformation" assertion #47436

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enable DeepSeek V4 and GLM-5.1 on SM120#43477

Enable DeepSeek V4 and GLM-5.1 on SM120#43477
simon-mo merged 28 commits into
vllm-project:mainfrom
lucifer1004:dsv4-sm120-flashinfer

lucifer1004 commented May 23, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

mergify Bot commented May 23, 2026

pasta-paul commented May 25, 2026

aidendle94 commented May 26, 2026 •

edited

Loading

lucifer1004 commented May 26, 2026 •

edited

Loading

mergify Bot commented May 27, 2026

philwinder commented Jun 18, 2026

philwinder commented Jun 18, 2026

lucifer1004 commented Jun 18, 2026

eugr commented Jun 19, 2026

lucifer1004 commented Jun 19, 2026

eugr commented Jun 19, 2026

lucifer1004 commented Jun 19, 2026

eugr commented Jun 19, 2026 •

edited

Loading

Uh oh!

eugr commented Jun 22, 2026

ormandj commented Jun 24, 2026

mglaubitz commented Jun 29, 2026

Labels

13 participants

Uh oh!

Uh oh!

Conversation

lucifer1004 commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Blocking Dependencies

Accuracy

Performance

DeepSeek-V4-Flash · TP=2 · No MTP

DeepSeek-V4-Flash · TP=2 · MTP=1

DeepSeek-V4-Flash · TP=2 · MTP=2

GLM-5.1-NVFP4 · TP=8

Example Usage

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify Bot commented May 23, 2026

pasta-paul commented May 25, 2026

aidendle94 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

lucifer1004 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mergify Bot commented May 27, 2026

philwinder commented Jun 18, 2026

philwinder commented Jun 18, 2026

lucifer1004 commented Jun 18, 2026

eugr commented Jun 19, 2026

lucifer1004 commented Jun 19, 2026

eugr commented Jun 19, 2026

lucifer1004 commented Jun 19, 2026

eugr commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eugr commented Jun 22, 2026

ormandj commented Jun 24, 2026

mglaubitz commented Jun 29, 2026

Labels

13 participants

lucifer1004 commented May 23, 2026 •

edited

Loading

aidendle94 commented May 26, 2026 •

edited

Loading

lucifer1004 commented May 26, 2026 •

edited

Loading

eugr commented Jun 19, 2026 •

edited

Loading