[Quant] Enable modelopt_mixed on Turing (SM75)#45375
Conversation
|
Documentation preview: https://vllm--45375.org.readthedocs.build/en/45375/ |
|
This pull request has merge conflicts that must be resolved before it can be |
75ed970 to
2efd609
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
0ea8a27 to
bfe5d81
Compare
xinli-sw
left a comment
There was a problem hiding this comment.
do we have tests for those?
Thanks for raising this — it's exactly the right thing to check, and I want to be upfront about what's testable here. The good news is this PR is narrow: it lowers a capability gate (ModelOptMixedPrecisionConfig.get_min_capability() 80 → 75) and introduces no new kernels. The SM75 paths it unlocks already exist and already support cc ≥ 7.5 — NVFP4 routed experts via Marlin W4A16, FP8 weight-only dense via MarlinFP8, FP8 MoE via Marlin (we even build dedicated Turing Marlin/Marlin-MoE kernels today). So there's no new compute path to prove out — just the gate that was keeping the existing ones from being selected on Turing. The issue we're facing here is hardware: our buildkite GPU fleet is A100 (SM80), L4 (SM89), H100/H200 (SM90), and B200 (SM100) — there's no Turing/T4 runner. Per-arch gating is standard (we already have a has_device_capability(75) gate in test_compressed_tensors.py), but anything we gate at 75 still lands on SM80+ in CI, so it runs the SM80 Marlin path rather than real Turing. That's the part I keep coming back to — we can't test actual operators specifically on Turing, and need to assume they work by parallel reasoning, or have been tested elsehwere, e.g., as a test in a separate CI/CD flow for an operator library, or as a stand-alone smoke test. So I validated it end-to-end where SM75 actually lives: a free Colab T4, where a small modelopt_mixed model loads and generates correctly — https://colab.research.google.com/drive/1G9kTedkUUvCTj2g-7z-aD9p2emlzZyey?usp=sharing. I'll paste the run output (model + sample generations + a short eval number) right into the PR so the evidence is captured here. One nice side effect: working on SM75 surfaced a shared-memory-budget limitation in the FlashInfer prefill path on small-smem GPUs. I filed it upstream (flashinfer-ai/flashinfer#3620) and submitted a fix (flashinfer-ai/flashinfer#3621), now under review by the FlashInfer team. The interim attention-floor bump (7.5 → 8.0) here is only so SM75 auto-selects a supported backend (TRITON_ATTN) in the meantime, with a revert pointer to Here's where I'd love your input: is there an SM75/T4 path in our CI infra that I'm not aware of? If there's a runner we can target, or a way to wire a small SM75 smoke test into the pipeline, point me at it and I'll set it |
e52889f to
8c141d2
Compare
|
Hi @mikekg, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
Head branch was pushed to by a user without write access
ae94859 to
a6581ba
Compare
|
Documentation preview: https://vllm--45375.org.readthedocs.build/en/45375/ |
51afef3 to
bf181aa
Compare
|
@LucasWilkinson fixing a doc issue stripped the auto merge label. can you please merge? |
|
This pull request has merge conflicts that must be resolved before it can be |
Lower ModelOptMixedPrecisionConfig.get_min_capability() from 80 to 75. modelopt_mixed runs on Turing the same way it runs on Ampere: NVFP4 routed experts via Marlin W4A16 (SM75+), FP8 weight-only dense via MarlinFP8 (cc>=7.5), and FP8 MoE via Marlin. None require native FP8 tensor cores. Validated end-to-end on a Tesla T4 (SM75) serving an NVFP4 checkpoint. On SM75, FlashInfer attention must not be auto-selected (its paged kernels fail on Turing); pair with the FlashInfer SM80 guard so attention falls back to TRITON_ATTN. Signed-off-by: Michael Gschwind <mgschwind@nvidia.com> Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
FlashInfer supports SM75+, but is currently broken on SM75 (Turing): flashinfer-ai/flashinfer#3620 (fix: flashinfer-ai/flashinfer#3621). Until that fix lands, raise the backend's advertised lower bound from 7.5 to 8.0 so FlashInfer is not auto-selected on SM75 and attention falls back to another supported backend (e.g. TRITON_ATTN) there. Revert to 7.5 once the FlashInfer fix is released. Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
bf181aa to
cf16db7
Compare
|
I noticed that the merged fix follows a direction very similar to the |
Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>
Purpose
Extend
modelopt_mixed(NVFP4 routed experts + FP8 weight-only dense) inference to Turing (SM75) by loweringModelOptMixedPrecisionConfig.get_min_capability()from 80 to 75. Marlin already supports SM75, so the kernels are in place — this gate was the only thing blocking it. The payoff is NVFP4 inference on widely available platforms, including a free Google Colab T4.The per-layer paths already run on SM75: NVFP4 routed experts via Marlin W4A16 (SM75+), FP8 weight-only dense via MarlinFP8 (cc ≥ 7.5), and FP8 MoE via Marlin. None require native FP8 tensor cores.
This also sets the FlashInfer attention backend's minimum compute capability to SM80 (it is supported on SM80+). FlashInfer is high in attention auto-selection; without this, lowering the floor to 75 would let it be auto-selected on Turing. With the SM80 lower bound, SM75 auto-selects a supported attention backend (e.g. TRITON_ATTN).
Changes
vllm/model_executor/layers/quantization/modelopt.py:ModelOptMixedPrecisionConfig.get_min_capability()80 → 75.vllm/v1/attention/backends/flashinfer.py:supports_compute_capability()lower bound 7.5 → 8.0.KV cache scope: this enables NVFP4 weights on Turing. The KV Cache type supported depends on the Attention function:
We can eitehr force Triton Attention for Turing at present and land this (pending an additional update to re-enable FlashInfer when the smem overflow is resolved), or wait till flash infer is fixed and then land with flashinfer.
Test Plan
Serve an NVFP4
modelopt_mixedcheckpoint on a Tesla T4 (SM75) with attentionauto-selection (no
--attention-backend), and confirm it loads and generates.https://colab.research.google.com/drive/1G9kTedkUUvCTj2g-7z-aD9p2emlzZyey?usp=sharing
Test Result
nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4on a Tesla T4 (SM75, driver 610, CUDA 13): attention auto-selects TRITON_ATTN, the model loads (Marlin NVFP4 MoE,dtypeauto-resolves to float16 on SM75), and generation succeeds. https://colab.research.google.com/drive/1G9kTedkUUvCTj2g-7z-aD9p2emlzZyey?usp=sharingThe SM80 path (A100) was covered by #45306.
AI assistance was used; the submitter reviewed every line and ran the tests above.