Skip to content

[Bugfix][Quantization] Don't reject fp8_e5m2 KV cache for non-fp8 quantized checkpoints#45040

Merged
yewentao256 merged 4 commits into
vllm-project:mainfrom
Sunt-ing:fix/39137-fp8-e5m2-kv-gate
Jun 18, 2026
Merged

[Bugfix][Quantization] Don't reject fp8_e5m2 KV cache for non-fp8 quantized checkpoints#45040
yewentao256 merged 4 commits into
vllm-project:mainfrom
Sunt-ing:fix/39137-fp8-e5m2-kv-gate

Conversation

@Sunt-ing

@Sunt-ing Sunt-ing commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Purpose

Closes #39137.

_init_kv_cache_quant rejects --kv-cache-dtype fp8_e5m2 for every quantized checkpoint with "fp8_e5m2 kv-cache is not supported with fp8 checkpoints". But weight-only checkpoints (INT4/INT8 compressed-tensors, AWQ, GPTQ) carry no fp8 KV scales and are not fp8. On Ampere, where fp8_e5m2 is the only fp8 KV cache dtype that runs at all (fp8_e4m3 is unsupported by the hardware), this makes fp8 KV cache unreachable for every weight-quantized model.

The gate now fires only when the checkpoint actually stores fp8 KV scales. A compressed-tensors checkpoint does so only when it declares a kv_cache_scheme; weight-only ones declare none. Genuine fp8 checkpoints (including compressed-tensors fp8-KV) stay rejected, so a class-only whitelist or a blanket compressed-tensors exemption would be wrong.

This deliberately leaves the separate query-quant fp8_e5m2 assert in Attention.forward alone. On Ampere, allowing it there only moves the crash into reshape_and_cache_kernel_flash, since the query-quant path uses e4m3, which Triton does not support on SM 8.x.

Related PRs

Test Plan

Reproduce with a weight-only W4A16 compressed-tensors checkpoint (no kv_cache_scheme) and fp8_e5m2 KV cache. The gate is a checkpoint-config check, so it fires identically on any CUDA device; the logs below are from an RTX 4090 (SM 8.9), and the fix was first validated on an A800 (SM 8.0), the Ampere case the Purpose describes.

from vllm import LLM, SamplingParams

llm = LLM(
    model="nm-testing/tinyllama-oneshot-w4a16-channel-v2",  # weight-only W4A16, no kv_cache_scheme
    kv_cache_dtype="fp8_e5m2",
    max_model_len=2048,
    gpu_memory_utilization=0.5,
    enforce_eager=True,
)
out = llm.generate("The capital of France is", SamplingParams(temperature=0.0, max_tokens=8))
print(out[0].outputs[0].text)

Equivalent server: vllm serve nm-testing/tinyllama-oneshot-w4a16-channel-v2 --kv-cache-dtype fp8_e5m2

Test Result

Same model, command, and environment for both runs; the only difference is the gate in _init_kv_cache_quant.

Before (main): engine core init fails at the gate, matching the report.

Full error
(EngineCore) Traceback (most recent call last):
  File "vllm/v1/engine/core.py", line 1171, in run_engine_core
    engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
  File "vllm/v1/engine/core.py", line 937, in __init__
    super().__init__(
  File "vllm/v1/engine/core.py", line 123, in __init__
    self.model_executor = executor_class(vllm_config)
  File "vllm/v1/executor/abstract.py", line 109, in __init__
    self._init_executor()
  File "vllm/v1/executor/uniproc_executor.py", line 68, in _init_executor
    self.driver_worker.load_model()
  File "vllm/v1/worker/gpu_worker.py", line 356, in load_model
    self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
  File "vllm/v1/worker/gpu_model_runner.py", line 5103, in load_model
    self.model = model_loader.load_model(
  File "vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
    model = initialize_model(
  File "vllm/model_executor/model_loader/utils.py", line 62, in initialize_model
    model = model_class(vllm_config=vllm_config, prefix=prefix)
  File "vllm/model_executor/models/llama.py", line 515, in __init__
    self.model = self._init_model(
  File "vllm/model_executor/models/llama.py", line 548, in _init_model
    return LlamaModel(vllm_config=vllm_config, prefix=prefix, layer_type=layer_type)
  File "vllm/model_executor/models/llama.py", line 378, in __init__
    self.start_layer, self.end_layer, self.layers = make_layers(
  File "vllm/model_executor/models/llama.py", line 380, in <lambda>
    lambda prefix: layer_type(vllm_config=vllm_config, prefix=prefix),
  File "vllm/model_executor/models/llama.py", line 288, in __init__
    self.self_attn = attn_layer_type(
  File "vllm/model_executor/models/llama.py", line 212, in __init__
    self.attn = attn_cls(
  File "vllm/model_executor/layers/attention/attention.py", line 415, in __init__
    _init_kv_cache_quant(self, quant_config, prefix)
  File "vllm/model_executor/layers/attention/attention.py", line 168, in _init_kv_cache_quant
    raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.")
ValueError: fp8_e5m2 kv-cache is not supported with fp8 checkpoints.

RuntimeError: Engine core initialization failed. See root cause above.

After (this PR): the engine initializes with fp8_e5m2 KV cache active and generates normally.

Fixed run
INFO [cache.py:279] Using fp8_e5m2 data type to store kv cache. ...
INFO [kv_cache_utils.py:2078] GPU KV cache size: 1,007,856 tokens
INFO [kv_cache_utils.py:2079] Maximum concurrency for 2,048 tokens per request: 492.12x

# generated text for "The capital of France is":
' Paris. The capital of France is Paris'

A weight-only compressed-tensors checkpoint is now allowed with fp8_e5m2; one declaring an fp8 kv_cache_scheme stays rejected. ruff check / ruff format pass.

AI assistance was used to investigate, reproduce, and draft this change; the author reviewed the diff and validation output.

…ntized checkpoints

_init_kv_cache_quant raised "fp8_e5m2 kv-cache is not supported with fp8
checkpoints" for every checkpoint whose attention quant method is a
BaseKVCacheMethod, including compressed-tensors weight-only (INT4/INT8/AWQ/
GPTQ) checkpoints that carry no fp8 KV scales. On Ampere, where fp8_e5m2 is
the only usable fp8 KV cache dtype, this made fp8 KV cache unreachable for
any weight-quantized model.

Narrow the gate so it only fires when the checkpoint actually stores fp8 KV
scales: compressed-tensors checkpoints only do so when they declare a
kv_cache_scheme; the other KV cache methods (Fp8KVCacheMethod,
ModelOptKVCacheMethod, QuarkKVCacheMethod) are only built for fp8/nvfp4
checkpoints and keep being rejected.

Fixes vllm-project#39137

Signed-off-by: Ting Sun <suntcrick@gmail.com>
@Sunt-ing

Copy link
Copy Markdown
Contributor Author

@yewentao256 PTAL, thanks~

@yewentao256 yewentao256 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!

Could you add full reproduce command in main and full error report in PR description? And logs in this PR the error got fixed

)


def _checkpoint_has_fp8_kv_scales(quant_method: QuantizeMethodBase) -> bool:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe inline this function as just used once?

Inline the _checkpoint_has_fp8_kv_scales helper into its only caller,
_init_kv_cache_quant. The unit test that imported the helper is removed
with it; the reproduce in the PR description covers the gate.

Signed-off-by: Ting Sun <suntcrick@gmail.com>
@Sunt-ing

Copy link
Copy Markdown
Contributor Author

Thanks @yewentao256, addressed both:

  • Inlined _checkpoint_has_fp8_kv_scales into its only caller _init_kv_cache_quant (latest commit).
  • Updated the description with the full reproduce command, the verbatim engine-core init error on main, and the logs from the fixed run (fp8_e5m2 KV cache active + a normal generation).
quant_config is typed as the base QuantizationConfig on BaseKVCacheMethod,
so accessing kv_cache_scheme tripped mypy even after narrowing quant_method
to CompressedTensorsKVCacheMethod. Cast to CompressedTensorsConfig at the
access site; behavior is unchanged.

Signed-off-by: Ting Sun <suntcrick@gmail.com>
@Sunt-ing

Copy link
Copy Markdown
Contributor Author

Hi @yewentao256, any updates on the review?

@yewentao256 yewentao256 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!

@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 18, 2026
@Sunt-ing

Copy link
Copy Markdown
Contributor Author

@yewentao256 Hi Wentao, could you please auto-merge it?

@yewentao256 yewentao256 merged commit 79ca54d into vllm-project:main Jun 18, 2026
81 checks passed
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…ntized checkpoints (vllm-project#45040)

Signed-off-by: Ting Sun <suntcrick@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Jun 21, 2026
…ntized checkpoints (vllm-project#45040)

Signed-off-by: Ting Sun <suntcrick@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…ntized checkpoints (vllm-project#45040)

Signed-off-by: Ting Sun <suntcrick@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…ntized checkpoints (vllm-project#45040)

Signed-off-by: Ting Sun <suntcrick@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

2 participants