[Quant] Enable modelopt_mixed on Turing (SM75) by mikekg · Pull Request #45375 · vllm-project/vllm

mikekg · 2026-06-12T06:48:24Z

Purpose

Extend modelopt_mixed (NVFP4 routed experts + FP8 weight-only dense) inference to Turing (SM75) by lowering ModelOptMixedPrecisionConfig.get_min_capability() from 80 to 75. Marlin already supports SM75, so the kernels are in place — this gate was the only thing blocking it. The payoff is NVFP4 inference on widely available platforms, including a free Google Colab T4.

The per-layer paths already run on SM75: NVFP4 routed experts via Marlin W4A16 (SM75+), FP8 weight-only dense via MarlinFP8 (cc ≥ 7.5), and FP8 MoE via Marlin. None require native FP8 tensor cores.

This also sets the FlashInfer attention backend's minimum compute capability to SM80 (it is supported on SM80+). FlashInfer is high in attention auto-selection; without this, lowering the floor to 75 would let it be auto-selected on Turing. With the SM80 lower bound, SM75 auto-selects a supported attention backend (e.g. TRITON_ATTN).

Changes

vllm/model_executor/layers/quantization/modelopt.py:
ModelOptMixedPrecisionConfig.get_min_capability() 80 → 75.
vllm/v1/attention/backends/flashinfer.py:
supports_compute_capability() lower bound 7.5 → 8.0.

KV cache scope: this enables NVFP4 weights on Turing. The KV Cache type supported depends on the Attention function:

FlashInfer already supports fp8 kv cache even on systems without FP8 support using software emulation. However, at present broken for T4, patch Fix prefill shared-memory budget so kernels launch on small-smem GPUs (SM75) flashinfer-ai/flashinfer#3621 pending.
Triton Attention does not support fp8 kv cache out of the box, so it wuld error out. Remove redundant Triton KV cache dtype asserts and enforce architectural support (fp8 >= sm89) #43914 creates a clean error message.

We can eitehr force Triton Attention for Turing at present and land this (pending an additional update to re-enable FlashInfer when the smem overflow is resolved), or wait till flash infer is fixed and then land with flashinfer.

Test Plan

Serve an NVFP4 modelopt_mixed checkpoint on a Tesla T4 (SM75) with attention
auto-selection (no --attention-backend), and confirm it loads and generates.

https://colab.research.google.com/drive/1G9kTedkUUvCTj2g-7z-aD9p2emlzZyey?usp=sharing

import faulthandler
import sys
import time

faulthandler.enable()
faulthandler.dump_traceback_later(120, repeat=True, file=sys.stderr)

print("STEP import vllm begin", flush=True)
from vllm import LLM, SamplingParams
print("STEP import vllm done", flush=True)

model = "/content/models/nemotron-nano-9b-v2-nvfp4"

print("STEP construct LLM begin", flush=True)
t0 = time.time()
llm = LLM(
    model=model,
    tokenizer=model,
    quantization="modelopt_mixed",
    tensor_parallel_size=1,
    trust_remote_code=True,
    dtype="float16",
    max_model_len=512,
    max_num_batched_tokens=512,
    max_num_seqs=1,
    gpu_memory_utilization=0.88,
    cpu_offload_gb=0,
    attention_backend="TRITON_ATTN",
    linear_backend="marlin",
)
print("STEP construct LLM done", round(time.time() - t0, 2), flush=True)

params = SamplingParams(max_tokens=1024, temperature=0)
print("STEP generate begin", flush=True)
t0 = time.time()
out = llm.generate(["What is the Capital of Austria? Answer with exactly one word."], params)
print("STEP generate done", round(time.time() - t0, 2), flush=True)
print(out[0].outputs[0].text, flush=True)
PY

Test Result

nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4 on a Tesla T4 (SM75, driver 610, CUDA 13): attention auto-selects TRITON_ATTN, the model loads (Marlin NVFP4 MoE, dtype auto-resolves to float16 on SM75), and generation succeeds. https://colab.research.google.com/drive/1G9kTedkUUvCTj2g-7z-aD9p2emlzZyey?usp=sharing

The SM80 path (A100) was covered by #45306.

AI assistance was used; the submitter reviewed every line and ran the tests above.

mergify · 2026-06-12T06:52:12Z

Documentation preview: https://vllm--45375.org.readthedocs.build/en/45375/

mergify · 2026-06-12T21:12:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikekg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-06-15T17:14:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikekg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

xinli-sw

do we have tests for those?

mikekg · 2026-06-16T13:54:01Z

do we have tests for those?

Thanks for raising this — it's exactly the right thing to check, and I want to be upfront about what's testable here.

The good news is this PR is narrow: it lowers a capability gate (ModelOptMixedPrecisionConfig.get_min_capability() 80 → 75) and introduces no new kernels. The SM75 paths it unlocks already exist and already support cc ≥ 7.5 — NVFP4 routed experts via Marlin W4A16, FP8 weight-only dense via MarlinFP8, FP8 MoE via Marlin (we even build dedicated Turing Marlin/Marlin-MoE kernels today). So there's no new compute path to prove out — just the gate that was keeping the existing ones from being selected on Turing.

The issue we're facing here is hardware: our buildkite GPU fleet is A100 (SM80), L4 (SM89), H100/H200 (SM90), and B200 (SM100) — there's no Turing/T4 runner. Per-arch gating is standard (we already have a has_device_capability(75) gate in test_compressed_tensors.py), but anything we gate at 75 still lands on SM80+ in CI, so it runs the SM80 Marlin path rather than real Turing. That's the part I keep coming back to — we can't test actual operators specifically on Turing, and need to assume they work by parallel reasoning, or have been tested elsehwere, e.g., as a test in a separate CI/CD flow for an operator library, or as a stand-alone smoke test.

So I validated it end-to-end where SM75 actually lives: a free Colab T4, where a small modelopt_mixed model loads and generates correctly — https://colab.research.google.com/drive/1G9kTedkUUvCTj2g-7z-aD9p2emlzZyey?usp=sharing. I'll paste the run output (model + sample generations + a short eval number) right into the PR so the evidence is captured here.

One nice side effect: working on SM75 surfaced a shared-memory-budget limitation in the FlashInfer prefill path on small-smem GPUs. I filed it upstream (flashinfer-ai/flashinfer#3620) and submitted a fix (flashinfer-ai/flashinfer#3621), now under review by the FlashInfer team. The interim attention-floor bump (7.5 → 8.0) here is only so SM75 auto-selects a supported backend (TRITON_ATTN) in the meantime, with a revert pointer to
#3621 — once that lands, FlashInfer comes back as an SM75 option. So it's a temporary floor, not a permanent exclusion.

Here's where I'd love your input: is there an SM75/T4 path in our CI infra that I'm not aware of? If there's a runner we can target, or a way to wire a small SM75 smoke test into the pipeline, point me at it and I'll set it
up gladly — I'd much rather have automated coverage than a notebook. If there isn't one available, then I think the attached T4 e2e result is the realistic bar for a gate-only change like this, and I'd propose we proceed on that basis. Either way works for me — I just want to land it the way you're comfortable with.

mergify · 2026-06-18T04:18:53Z

Hi @mikekg, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

mergify · 2026-06-18T04:58:50Z

Documentation preview: https://vllm--45375.org.readthedocs.build/en/45375/

mikekg · 2026-06-18T18:59:02Z

@LucasWilkinson fixing a doc issue stripped the auto merge label. can you please merge?

mergify · 2026-06-22T05:19:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikekg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Lower ModelOptMixedPrecisionConfig.get_min_capability() from 80 to 75. modelopt_mixed runs on Turing the same way it runs on Ampere: NVFP4 routed experts via Marlin W4A16 (SM75+), FP8 weight-only dense via MarlinFP8 (cc>=7.5), and FP8 MoE via Marlin. None require native FP8 tensor cores. Validated end-to-end on a Tesla T4 (SM75) serving an NVFP4 checkpoint. On SM75, FlashInfer attention must not be auto-selected (its paged kernels fail on Turing); pair with the FlashInfer SM80 guard so attention falls back to TRITON_ATTN. Signed-off-by: Michael Gschwind <mgschwind@nvidia.com> Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>

FlashInfer supports SM75+, but is currently broken on SM75 (Turing): flashinfer-ai/flashinfer#3620 (fix: flashinfer-ai/flashinfer#3621). Until that fix lands, raise the backend's advertised lower bound from 7.5 to 8.0 so FlashInfer is not auto-selected on SM75 and attention falls back to another supported backend (e.g. TRITON_ATTN) there. Revert to 7.5 once the FlashInfer fix is released. Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>

Li-brua · 2026-06-24T07:00:24Z

I noticed that the merged fix follows a direction very similar to the
proposal in my earlier PR #44403

Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>

Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

mikekg requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, vadiklyutiy, yewentao256 and zyongye as code owners June 12, 2026 06:48

mergify Bot added documentation Improvements or additions to documentation nvidia v1 labels Jun 12, 2026

github-project-automation Bot added this to NVIDIA Jun 12, 2026

mikekg commented Jun 12, 2026

View reviewed changes

Comment thread docs/design/attention_backends.md Outdated

mergify Bot added the needs-rebase label Jun 12, 2026

mikekg force-pushed the modelopt-turing-sm75 branch from 75ed970 to 2efd609 Compare June 12, 2026 21:16

mergify Bot removed the needs-rebase label Jun 12, 2026

mikekg mentioned this pull request Jun 13, 2026

Fleetwide nvfp4 mikekg/vllm#1

Closed

mergify Bot added the needs-rebase label Jun 15, 2026

mikekg force-pushed the modelopt-turing-sm75 branch 2 times, most recently from 0ea8a27 to bfe5d81 Compare June 15, 2026 19:58

mergify Bot removed the needs-rebase label Jun 15, 2026

xinli-sw reviewed Jun 16, 2026

View reviewed changes

mikekg mentioned this pull request Jun 16, 2026

TRITON_ATTN support for KV cache dtype fp8 on sm75 to pre-sm89 #45829

Open

mikekg force-pushed the modelopt-turing-sm75 branch from e52889f to 8c141d2 Compare June 16, 2026 14:27

mikekg closed this Jun 16, 2026

mikekg deleted the modelopt-turing-sm75 branch June 16, 2026 23:56

github-project-automation Bot moved this to Done in NVIDIA Jun 16, 2026

LucasWilkinson enabled auto-merge (squash) June 18, 2026 04:18

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 18, 2026

auto-merge was automatically disabled June 18, 2026 04:55
Head branch was pushed to by a user without write access

mikekg force-pushed the modelopt-turing-sm75 branch from ae94859 to a6581ba Compare June 18, 2026 04:57

mikekg force-pushed the modelopt-turing-sm75 branch 2 times, most recently from 51afef3 to bf181aa Compare June 18, 2026 06:05

mergify Bot added the needs-rebase label Jun 22, 2026

mikekg added 2 commits June 22, 2026 08:55

mikekg force-pushed the modelopt-turing-sm75 branch from bf181aa to cf16db7 Compare June 22, 2026 15:57

mergify Bot removed the needs-rebase label Jun 22, 2026

pavanimajety enabled auto-merge (squash) June 22, 2026 17:52

mikekg added 5 commits June 22, 2026 13:47

Merge branch 'main' into modelopt-turing-sm75

814c134

Merge branch 'main' into modelopt-turing-sm75

e44daf0

Merge branch 'main' into modelopt-turing-sm75

1d857a0

Merge branch 'main' into modelopt-turing-sm75

70da508

Merge branch 'main' into modelopt-turing-sm75

8823c6d

mgoin approved these changes Jun 22, 2026

View reviewed changes

vllm-bot merged commit 56e5797 into vllm-project:main Jun 23, 2026
100 of 106 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 23, 2026

ir1ka mentioned this pull request Jun 23, 2026

[Feature]: support nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 for turing and ampere #38776

Closed

1 task

Li-brua mentioned this pull request Jun 24, 2026

Fix FlashInfer attention auto-selection on SM75 GPUs #44403

Open

4 tasks

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[Quant] Enable modelopt_mixed on Turing (SM75) (vllm-project#45375)

4ed9b87

Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>

mikekg mentioned this pull request Jun 25, 2026

[Bug]: [v0.22] Crash when calling API to inference to a GGUF model #44379

Open

1 task

qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026

[Quant] Enable modelopt_mixed on Turing (SM75) (vllm-project#45375)

4fdecb6

Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Quant] Enable modelopt_mixed on Turing (SM75)#45375

[Quant] Enable modelopt_mixed on Turing (SM75)#45375
vllm-bot merged 7 commits into
vllm-project:mainfrom
mikekg:modelopt-turing-sm75

mikekg commented Jun 12, 2026 •

edited

Loading

mergify Bot commented Jun 12, 2026

Uh oh!

mergify Bot commented Jun 12, 2026

mergify Bot commented Jun 15, 2026

xinli-sw left a comment

mikekg commented Jun 16, 2026 •

edited

Loading

mergify Bot commented Jun 18, 2026

mergify Bot commented Jun 18, 2026

mikekg commented Jun 18, 2026

mergify Bot commented Jun 22, 2026

Uh oh!

Li-brua commented Jun 24, 2026

Labels

6 participants

Uh oh!

Uh oh!

Conversation

mikekg commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Test Result

mergify Bot commented Jun 12, 2026

Uh oh!

mergify Bot commented Jun 12, 2026

mergify Bot commented Jun 15, 2026

xinli-sw left a comment

Choose a reason for hiding this comment

mikekg commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mergify Bot commented Jun 18, 2026

mergify Bot commented Jun 18, 2026

mikekg commented Jun 18, 2026

mergify Bot commented Jun 22, 2026

Uh oh!

Li-brua commented Jun 24, 2026

Labels

6 participants

mikekg commented Jun 12, 2026 •

edited

Loading

mikekg commented Jun 16, 2026 •

edited

Loading