[Bugfix][Quantization] Don't reject fp8_e5m2 KV cache for non-fp8 quantized checkpoints by Sunt-ing · Pull Request #45040 · vllm-project/vllm

Sunt-ing · 2026-06-09T16:26:37Z

Purpose

_init_kv_cache_quant rejects --kv-cache-dtype fp8_e5m2 for every quantized checkpoint with "fp8_e5m2 kv-cache is not supported with fp8 checkpoints". But weight-only checkpoints (INT4/INT8 compressed-tensors, AWQ, GPTQ) carry no fp8 KV scales and are not fp8. On Ampere, where fp8_e5m2 is the only fp8 KV cache dtype that runs at all (fp8_e4m3 is unsupported by the hardware), this makes fp8 KV cache unreachable for every weight-quantized model.

The gate now fires only when the checkpoint actually stores fp8 KV scales. A compressed-tensors checkpoint does so only when it declares a kv_cache_scheme; weight-only ones declare none. Genuine fp8 checkpoints (including compressed-tensors fp8-KV) stay rejected, so a class-only whitelist or a blanket compressed-tensors exemption would be wrong.

This deliberately leaves the separate query-quant fp8_e5m2 assert in Attention.forward alone. On Ampere, allowing it there only moves the crash into reshape_and_cache_kernel_flash, since the query-quant path uses e4m3, which Triton does not support on SM 8.x.

Related PRs

[Bugfix] Narrow fp8_e5m2 kv-cache gate to only reject actual fp8 checkpoints #39195 (closed): narrowed the gate to BaseKVCacheMethod, which does not help because CompressedTensorsKVCacheMethod is itself a subclass.
Fix fp8_e5m2 KV cache blocked for AWQ/GPTQ models #39255 (open): exempts all compressed-tensors by class and also patches the query-quant assert, which only moves the Ampere crash deeper as above.

Test Plan

Reproduce with a weight-only W4A16 compressed-tensors checkpoint (no kv_cache_scheme) and fp8_e5m2 KV cache. The gate is a checkpoint-config check, so it fires identically on any CUDA device; the logs below are from an RTX 4090 (SM 8.9), and the fix was first validated on an A800 (SM 8.0), the Ampere case the Purpose describes.

from vllm import LLM, SamplingParams

llm = LLM(
    model="nm-testing/tinyllama-oneshot-w4a16-channel-v2",  # weight-only W4A16, no kv_cache_scheme
    kv_cache_dtype="fp8_e5m2",
    max_model_len=2048,
    gpu_memory_utilization=0.5,
    enforce_eager=True,
)
out = llm.generate("The capital of France is", SamplingParams(temperature=0.0, max_tokens=8))
print(out[0].outputs[0].text)

Equivalent server: vllm serve nm-testing/tinyllama-oneshot-w4a16-channel-v2 --kv-cache-dtype fp8_e5m2

Test Result

Same model, command, and environment for both runs; the only difference is the gate in _init_kv_cache_quant.

Before (main): engine core init fails at the gate, matching the report.

Full error

(EngineCore) Traceback (most recent call last):
  File "vllm/v1/engine/core.py", line 1171, in run_engine_core
    engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
  File "vllm/v1/engine/core.py", line 937, in __init__
    super().__init__(
  File "vllm/v1/engine/core.py", line 123, in __init__
    self.model_executor = executor_class(vllm_config)
  File "vllm/v1/executor/abstract.py", line 109, in __init__
    self._init_executor()
  File "vllm/v1/executor/uniproc_executor.py", line 68, in _init_executor
    self.driver_worker.load_model()
  File "vllm/v1/worker/gpu_worker.py", line 356, in load_model
    self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
  File "vllm/v1/worker/gpu_model_runner.py", line 5103, in load_model
    self.model = model_loader.load_model(
  File "vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
    model = initialize_model(
  File "vllm/model_executor/model_loader/utils.py", line 62, in initialize_model
    model = model_class(vllm_config=vllm_config, prefix=prefix)
  File "vllm/model_executor/models/llama.py", line 515, in __init__
    self.model = self._init_model(
  File "vllm/model_executor/models/llama.py", line 548, in _init_model
    return LlamaModel(vllm_config=vllm_config, prefix=prefix, layer_type=layer_type)
  File "vllm/model_executor/models/llama.py", line 378, in __init__
    self.start_layer, self.end_layer, self.layers = make_layers(
  File "vllm/model_executor/models/llama.py", line 380, in <lambda>
    lambda prefix: layer_type(vllm_config=vllm_config, prefix=prefix),
  File "vllm/model_executor/models/llama.py", line 288, in __init__
    self.self_attn = attn_layer_type(
  File "vllm/model_executor/models/llama.py", line 212, in __init__
    self.attn = attn_cls(
  File "vllm/model_executor/layers/attention/attention.py", line 415, in __init__
    _init_kv_cache_quant(self, quant_config, prefix)
  File "vllm/model_executor/layers/attention/attention.py", line 168, in _init_kv_cache_quant
    raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.")
ValueError: fp8_e5m2 kv-cache is not supported with fp8 checkpoints.

RuntimeError: Engine core initialization failed. See root cause above.

After (this PR): the engine initializes with fp8_e5m2 KV cache active and generates normally.

Fixed run

INFO [cache.py:279] Using fp8_e5m2 data type to store kv cache. ...
INFO [kv_cache_utils.py:2078] GPU KV cache size: 1,007,856 tokens
INFO [kv_cache_utils.py:2079] Maximum concurrency for 2,048 tokens per request: 492.12x

# generated text for "The capital of France is":
' Paris. The capital of France is Paris'

A weight-only compressed-tensors checkpoint is now allowed with fp8_e5m2; one declaring an fp8 kv_cache_scheme stays rejected. ruff check / ruff format pass.

AI assistance was used to investigate, reproduce, and draft this change; the author reviewed the diff and validation output.

…ntized checkpoints _init_kv_cache_quant raised "fp8_e5m2 kv-cache is not supported with fp8 checkpoints" for every checkpoint whose attention quant method is a BaseKVCacheMethod, including compressed-tensors weight-only (INT4/INT8/AWQ/ GPTQ) checkpoints that carry no fp8 KV scales. On Ampere, where fp8_e5m2 is the only usable fp8 KV cache dtype, this made fp8 KV cache unreachable for any weight-quantized model. Narrow the gate so it only fires when the checkpoint actually stores fp8 KV scales: compressed-tensors checkpoints only do so when they declare a kv_cache_scheme; the other KV cache methods (Fp8KVCacheMethod, ModelOptKVCacheMethod, QuarkKVCacheMethod) are only built for fp8/nvfp4 checkpoints and keep being rejected. Fixes vllm-project#39137 Signed-off-by: Ting Sun <suntcrick@gmail.com>

Sunt-ing · 2026-06-17T17:38:43Z

@yewentao256 PTAL, thanks~

yewentao256

Thanks for the work!

Could you add full reproduce command in main and full error report in PR description? And logs in this PR the error got fixed

yewentao256 · 2026-06-17T18:56:42Z

    )


+def _checkpoint_has_fp8_kv_scales(quant_method: QuantizeMethodBase) -> bool:


Maybe inline this function as just used once?

Inline the _checkpoint_has_fp8_kv_scales helper into its only caller, _init_kv_cache_quant. The unit test that imported the helper is removed with it; the reproduce in the PR description covers the gate. Signed-off-by: Ting Sun <suntcrick@gmail.com>

Sunt-ing · 2026-06-17T19:42:31Z

Thanks @yewentao256, addressed both:

Inlined _checkpoint_has_fp8_kv_scales into its only caller _init_kv_cache_quant (latest commit).
Updated the description with the full reproduce command, the verbatim engine-core init error on main, and the logs from the fixed run (fp8_e5m2 KV cache active + a normal generation).

quant_config is typed as the base QuantizationConfig on BaseKVCacheMethod, so accessing kv_cache_scheme tripped mypy even after narrowing quant_method to CompressedTensorsKVCacheMethod. Cast to CompressedTensorsConfig at the access site; behavior is unchanged. Signed-off-by: Ting Sun <suntcrick@gmail.com>

Sunt-ing · 2026-06-18T14:34:38Z

Hi @yewentao256, any updates on the review?

yewentao256

LGTM, thanks for the work!

Sunt-ing · 2026-06-18T17:12:38Z

@yewentao256 Hi Wentao, could you please auto-merge it?

…ntized checkpoints (vllm-project#45040) Signed-off-by: Ting Sun <suntcrick@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

…ntized checkpoints (vllm-project#45040) Signed-off-by: Ting Sun <suntcrick@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

Sunt-ing requested review from AndreasKaratzas, LucasWilkinson, MatthewBonanni, mgoin, pavanimajety, robertgshaw2-redhat, yewentao256 and zyongye as code owners June 9, 2026 16:26

mergify Bot added the bug Something isn't working label Jun 9, 2026

yewentao256 reviewed Jun 17, 2026

View reviewed changes

yewentao256 approved these changes Jun 18, 2026

View reviewed changes

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 18, 2026

Merge branch 'main' into fix/39137-fp8-e5m2-kv-gate

79dd882

yewentao256 merged commit 79ca54d into vllm-project:main Jun 18, 2026
81 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix][Quantization] Don't reject fp8_e5m2 KV cache for non-fp8 quantized checkpoints#45040

[Bugfix][Quantization] Don't reject fp8_e5m2 KV cache for non-fp8 quantized checkpoints#45040
yewentao256 merged 4 commits into
vllm-project:mainfrom
Sunt-ing:fix/39137-fp8-e5m2-kv-gate

Sunt-ing commented Jun 9, 2026 •

edited

Loading

Sunt-ing commented Jun 17, 2026

yewentao256 left a comment

yewentao256 Jun 17, 2026

Sunt-ing commented Jun 17, 2026

Sunt-ing commented Jun 18, 2026

yewentao256 left a comment

Sunt-ing commented Jun 18, 2026

Uh oh!

Labels

2 participants

		)


		def _checkpoint_has_fp8_kv_scales(quant_method: QuantizeMethodBase) -> bool:

Uh oh!

Uh oh!

Conversation

Sunt-ing commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Sunt-ing commented Jun 17, 2026

yewentao256 left a comment

Choose a reason for hiding this comment

yewentao256 Jun 17, 2026

Choose a reason for hiding this comment

Sunt-ing commented Jun 17, 2026

Sunt-ing commented Jun 18, 2026

yewentao256 left a comment

Choose a reason for hiding this comment

Sunt-ing commented Jun 18, 2026

Uh oh!

Labels

2 participants

Sunt-ing commented Jun 9, 2026 •

edited

Loading