Skip to content

[Quant] Enable modelopt_mixed on Turing (SM75)#45375

Merged
vllm-bot merged 7 commits into
vllm-project:mainfrom
mikekg:modelopt-turing-sm75
Jun 23, 2026
Merged

[Quant] Enable modelopt_mixed on Turing (SM75)#45375
vllm-bot merged 7 commits into
vllm-project:mainfrom
mikekg:modelopt-turing-sm75

Conversation

@mikekg

@mikekg mikekg commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Purpose

Extend modelopt_mixed (NVFP4 routed experts + FP8 weight-only dense) inference to Turing (SM75) by lowering ModelOptMixedPrecisionConfig.get_min_capability() from 80 to 75. Marlin already supports SM75, so the kernels are in place — this gate was the only thing blocking it. The payoff is NVFP4 inference on widely available platforms, including a free Google Colab T4.

The per-layer paths already run on SM75: NVFP4 routed experts via Marlin W4A16 (SM75+), FP8 weight-only dense via MarlinFP8 (cc ≥ 7.5), and FP8 MoE via Marlin. None require native FP8 tensor cores.

This also sets the FlashInfer attention backend's minimum compute capability to SM80 (it is supported on SM80+). FlashInfer is high in attention auto-selection; without this, lowering the floor to 75 would let it be auto-selected on Turing. With the SM80 lower bound, SM75 auto-selects a supported attention backend (e.g. TRITON_ATTN).

Changes

  • vllm/model_executor/layers/quantization/modelopt.py:
    ModelOptMixedPrecisionConfig.get_min_capability() 80 → 75.
  • vllm/v1/attention/backends/flashinfer.py:
    supports_compute_capability() lower bound 7.5 → 8.0.

KV cache scope: this enables NVFP4 weights on Turing. The KV Cache type supported depends on the Attention function:

We can eitehr force Triton Attention for Turing at present and land this (pending an additional update to re-enable FlashInfer when the smem overflow is resolved), or wait till flash infer is fixed and then land with flashinfer.

Test Plan

Serve an NVFP4 modelopt_mixed checkpoint on a Tesla T4 (SM75) with attention
auto-selection (no --attention-backend), and confirm it loads and generates.

https://colab.research.google.com/drive/1G9kTedkUUvCTj2g-7z-aD9p2emlzZyey?usp=sharing

import faulthandler
import sys
import time

faulthandler.enable()
faulthandler.dump_traceback_later(120, repeat=True, file=sys.stderr)

print("STEP import vllm begin", flush=True)
from vllm import LLM, SamplingParams
print("STEP import vllm done", flush=True)

model = "/content/models/nemotron-nano-9b-v2-nvfp4"

print("STEP construct LLM begin", flush=True)
t0 = time.time()
llm = LLM(
    model=model,
    tokenizer=model,
    quantization="modelopt_mixed",
    tensor_parallel_size=1,
    trust_remote_code=True,
    dtype="float16",
    max_model_len=512,
    max_num_batched_tokens=512,
    max_num_seqs=1,
    gpu_memory_utilization=0.88,
    cpu_offload_gb=0,
    attention_backend="TRITON_ATTN",
    linear_backend="marlin",
)
print("STEP construct LLM done", round(time.time() - t0, 2), flush=True)

params = SamplingParams(max_tokens=1024, temperature=0)
print("STEP generate begin", flush=True)
t0 = time.time()
out = llm.generate(["What is the Capital of Austria? Answer with exactly one word."], params)
print("STEP generate done", round(time.time() - t0, 2), flush=True)
print(out[0].outputs[0].text, flush=True)
PY

Test Result

nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4 on a Tesla T4 (SM75, driver 610, CUDA 13): attention auto-selects TRITON_ATTN, the model loads (Marlin NVFP4 MoE, dtype auto-resolves to float16 on SM75), and generation succeeds. https://colab.research.google.com/drive/1G9kTedkUUvCTj2g-7z-aD9p2emlzZyey?usp=sharing

The SM80 path (A100) was covered by #45306.


AI assistance was used; the submitter reviewed every line and ran the tests above.

@mergify

mergify Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor
@mergify mergify Bot added documentation Improvements or additions to documentation nvidia v1 labels Jun 12, 2026
Comment thread docs/design/attention_backends.md Outdated
@mergify

mergify Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikekg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 12, 2026
@mikekg mikekg force-pushed the modelopt-turing-sm75 branch from 75ed970 to 2efd609 Compare June 12, 2026 21:16
@mergify mergify Bot removed the needs-rebase label Jun 12, 2026
@mikekg mikekg mentioned this pull request Jun 13, 2026
@mergify

mergify Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikekg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 15, 2026
@mikekg mikekg force-pushed the modelopt-turing-sm75 branch 2 times, most recently from 0ea8a27 to bfe5d81 Compare June 15, 2026 19:58
@mergify mergify Bot removed the needs-rebase label Jun 15, 2026

@xinli-sw xinli-sw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have tests for those?

@mikekg

mikekg commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

do we have tests for those?

Thanks for raising this — it's exactly the right thing to check, and I want to be upfront about what's testable here.

The good news is this PR is narrow: it lowers a capability gate (ModelOptMixedPrecisionConfig.get_min_capability() 80 → 75) and introduces no new kernels. The SM75 paths it unlocks already exist and already support cc ≥ 7.5 — NVFP4 routed experts via Marlin W4A16, FP8 weight-only dense via MarlinFP8, FP8 MoE via Marlin (we even build dedicated Turing Marlin/Marlin-MoE kernels today). So there's no new compute path to prove out — just the gate that was keeping the existing ones from being selected on Turing.

The issue we're facing here is hardware: our buildkite GPU fleet is A100 (SM80), L4 (SM89), H100/H200 (SM90), and B200 (SM100) — there's no Turing/T4 runner. Per-arch gating is standard (we already have a has_device_capability(75) gate in test_compressed_tensors.py), but anything we gate at 75 still lands on SM80+ in CI, so it runs the SM80 Marlin path rather than real Turing. That's the part I keep coming back to — we can't test actual operators specifically on Turing, and need to assume they work by parallel reasoning, or have been tested elsehwere, e.g., as a test in a separate CI/CD flow for an operator library, or as a stand-alone smoke test.

So I validated it end-to-end where SM75 actually lives: a free Colab T4, where a small modelopt_mixed model loads and generates correctly — https://colab.research.google.com/drive/1G9kTedkUUvCTj2g-7z-aD9p2emlzZyey?usp=sharing. I'll paste the run output (model + sample generations + a short eval number) right into the PR so the evidence is captured here.

One nice side effect: working on SM75 surfaced a shared-memory-budget limitation in the FlashInfer prefill path on small-smem GPUs. I filed it upstream (flashinfer-ai/flashinfer#3620) and submitted a fix (flashinfer-ai/flashinfer#3621), now under review by the FlashInfer team. The interim attention-floor bump (7.5 → 8.0) here is only so SM75 auto-selects a supported backend (TRITON_ATTN) in the meantime, with a revert pointer to
#3621 — once that lands, FlashInfer comes back as an SM75 option. So it's a temporary floor, not a permanent exclusion.

Here's where I'd love your input: is there an SM75/T4 path in our CI infra that I'm not aware of? If there's a runner we can target, or a way to wire a small SM75 smoke test into the pipeline, point me at it and I'll set it
up gladly — I'd much rather have automated coverage than a notebook. If there isn't one available, then I think the attached T4 e2e result is the realistic bar for a gate-only change like this, and I'd propose we proceed on that basis. Either way works for me — I just want to land it the way you're comfortable with.

@mikekg mikekg force-pushed the modelopt-turing-sm75 branch from e52889f to 8c141d2 Compare June 16, 2026 14:27
@mikekg mikekg closed this Jun 16, 2026
@mikekg mikekg deleted the modelopt-turing-sm75 branch June 16, 2026 23:56
@github-project-automation github-project-automation Bot moved this to Done in NVIDIA Jun 16, 2026
@LucasWilkinson LucasWilkinson enabled auto-merge (squash) June 18, 2026 04:18
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 18, 2026
@mergify

mergify Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Hi @mikekg, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

auto-merge was automatically disabled June 18, 2026 04:55

Head branch was pushed to by a user without write access

@mikekg mikekg force-pushed the modelopt-turing-sm75 branch from ae94859 to a6581ba Compare June 18, 2026 04:57
@mergify

mergify Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor
@mikekg mikekg force-pushed the modelopt-turing-sm75 branch 2 times, most recently from 51afef3 to bf181aa Compare June 18, 2026 06:05
@mikekg

mikekg commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@LucasWilkinson fixing a doc issue stripped the auto merge label. can you please merge?

@mergify

mergify Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikekg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 22, 2026
mikekg added 2 commits June 22, 2026 08:55
Lower ModelOptMixedPrecisionConfig.get_min_capability() from 80 to 75.
modelopt_mixed runs on Turing the same way it runs on Ampere: NVFP4 routed
experts via Marlin W4A16 (SM75+), FP8 weight-only dense via MarlinFP8
(cc>=7.5), and FP8 MoE via Marlin. None require native FP8 tensor cores.
Validated end-to-end on a Tesla T4 (SM75) serving an NVFP4 checkpoint.

On SM75, FlashInfer attention must not be auto-selected (its paged kernels
fail on Turing); pair with the FlashInfer SM80 guard so attention falls back
to TRITON_ATTN.

Signed-off-by: Michael Gschwind <mgschwind@nvidia.com>

Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
FlashInfer supports SM75+, but is currently broken on SM75 (Turing):
flashinfer-ai/flashinfer#3620 (fix: flashinfer-ai/flashinfer#3621). Until
that fix lands, raise the backend's advertised lower bound from 7.5 to 8.0
so FlashInfer is not auto-selected on SM75 and attention falls back to
another supported backend (e.g. TRITON_ATTN) there. Revert to 7.5 once the
FlashInfer fix is released.

Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
@mikekg mikekg force-pushed the modelopt-turing-sm75 branch from bf181aa to cf16db7 Compare June 22, 2026 15:57
@mergify mergify Bot removed the needs-rebase label Jun 22, 2026
@pavanimajety pavanimajety enabled auto-merge (squash) June 22, 2026 17:52
@vllm-bot vllm-bot merged commit 56e5797 into vllm-project:main Jun 23, 2026
100 of 106 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 23, 2026
@Li-brua

Li-brua commented Jun 24, 2026

Copy link
Copy Markdown

I noticed that the merged fix follows a direction very similar to the
proposal in my earlier PR #44403

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

6 participants