Enable DeepSeek V4 and GLM-5.1 on SM120#43477
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for DeepSeek-V4 sparse-MLA on consumer Blackwell GPUs (SM120/SM121) by adding a new SPARSE_MLA_SM120 backend leveraging FlashInfer. Key changes include the implementation of kernel warmup logic for mHC TileLang and sparse-MLA kernels, refactoring DeepGEMM utilities to support dynamic alignment and memory-efficient scale packing for SM100/SM120, and updating CUDA backend priorities. Review feedback identifies that the warmup logic is currently missing the new backend names in its eligibility check and lacks support for V0 runners, which could result in skipped warmups and JIT-induced latency spikes.
b5d8fdc to
6dc4b92
Compare
|
Documentation preview: https://vllm--43477.org.readthedocs.build/en/43477/ |
|
Tracking this PR with interest from a downstream that ships DeepSeek-V4 quantization artifacts (canada-quant). We've been validating two flavors of DSv4-Flash on the same hardware (RTX PRO 6000 Server Edition / SM_120) via jasl's fork:
Two notes that may be relevant to this PR's scope:
Happy to test against this branch when its FlashInfer/DeepGEMM dependency branches stabilize. Full repro for both artifacts: Related: |
|
For deepseek V4 at around 230k context it will start to error out due to a split k error. I was able to have codex assist me in resolving it upstream within flash infer. |
|
I have seen your fix at aidendle94/flashinfer@bf4fa21, will absorb that. |
|
This pull request has merge conflicts that must be resolved before it can be |
a512579 to
da05721
Compare
|
Hi @lucifer1004, thank you for your work. I can confirm this also works for me on 6x NVIDIA RTX PRO 6000 Blackwell, CUDA 13.0 with driver 580.142. I made a couple of changes to make this work. First: I think there's more to squeeze out of the gpu-memory-utilization for better concurrency. Here's the full docker compose for reference and future people: glm-51:
image: lucifer1004/dsv4-flash-sm120:20260604
container_name: vllm-glm-51
restart: unless-stopped
ports:
- "127.0.0.1:8000:8000"
volumes:
- /prod/models:/root/.cache/huggingface
- /prod/glm-cache:/cache
environment:
- HF_TOKEN
- HF_HUB_OFFLINE=1
- MAX_JOBS=8
shm_size: 1g
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["2", "3", "4", "5", "6", "7"]
capabilities: [gpu]
command:
- serve
- --model
- nvidia/GLM-5.1-NVFP4
- --served-model-name
- "glm-5.1"
- --trust-remote-code
- --chat-template-content-format
- "string"
- --tensor-parallel-size
- "2"
- --pipeline-parallel-size
- "3"
- --enable-chunked-prefill
- --enable-prefix-caching
- --gpu-memory-utilization
- "0.85"
- --max-model-len
- "131072"
- --max-num-batched-tokens
- "8192"
- --max-num-seqs
- "64"
- --kv-cache-dtype
- "fp8"
- --moe-backend
- "flashinfer_cutlass"
- --tool-call-parser
- "glm47"
- --enable-auto-tool-choice
- --reasoning-parser
- "glm45"
- --attention-backend
- "FLASHINFER_MLA_SPARSE"
- --max-cudagraph-capture-size
- "256"
- --hf-overrides
- '{"index_topk_pattern":"FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSSF"}'
- --enable-flashinfer-autotune |
|
And FYI, the early GLM 5.2 nvpf4 versions do not work because:
@lucifer1004 would it be possible for you to test this branch against these? I tried: |
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Upstream has fixed this issue, so the latest commit (with upstream merged) should run normally. |
|
Was it tested on DGX Spark cluster? Any performance numbers? |
Yes, and there is a docker image built for Spark: https://hub.docker.com/layers/lucifer1004/dsv4-flash-sm120/sm120-20260528-arm64-nccl2304/images/sha256-415d8957dce6ac44f8a0573fbe8f3c1f7a6b4eb7df75c7040ada9fc8db022172 Perf number is not listed here because the testing setting was a bit different from the other scenarios. Functionality is all good. |
|
@lucifer1004 - thanks. I see that that container is older than the most recent one - anything important that is missing? |
Both x86 and arm images miss an important update in DeepGEMM. Will upload new images recently. |
|
EDIT: please ignore, just seen your previous message - will rebuild from the source with latest DeepGEMM updates and test. |
|
@simon-mo, @lucifer1004 - now that it is merged, do we still have an unresolved dependency on DeepGEMM or it's been resolved too? |
|
@lucifer1004 I see the DeepGEMM received a merge into the nv-dev branch there, does VLLM's pin need to be updated now so models that are impacted will start working again from vllm main? Thank you for all of your efforts! |
I guess that is the DeepGemm error you were talking about: `(Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] WorkerProc hit an exception. (Worker_TP0 pid=2352) ERROR 06-29 11:22:29 [multiproc_executor.py:1000] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6692, in _warmup_and_capture 13:23:22 [106/730] |
Summary
This draft PR brings up the DeepSeek V4 and GLM-5.1 SM120 path on consumer Blackwell:
To use the version now, you can use the pre-built image at https://hub.docker.com/r/lucifer1004/dsv4-flash-sm120
Blocking Dependencies
This PR is draft-only until the following dependency branches are either merged upstream or replaced by equivalent released/pinned revisions:
Until those branches land upstream, this vLLM branch should be built/tested with the matching local checkouts, e.g. via
DEEPGEMM_SRC_DIRfor DeepGEMM and a FlashInfer install from the branch above.Accuracy
Performance
DeepSeek-V4-Flash · TP=2 · No MTP
DeepSeek-V4-Flash · TP=2 · MTP=1
DeepSeek-V4-Flash · TP=2 · MTP=2
GLM-5.1-NVFP4 · TP=8
Example Usage
After installing this vLLM branch with the dependency revisions above, the following commands exercise the SM120 sparse MLA path for DeepSeek V4 Flash and GLM-5.1.