[Bugfix][Quantization] Don't reject fp8_e5m2 KV cache for non-fp8 quantized checkpoints#45040
Merged
yewentao256 merged 4 commits intoJun 18, 2026
Merged
Conversation
…ntized checkpoints _init_kv_cache_quant raised "fp8_e5m2 kv-cache is not supported with fp8 checkpoints" for every checkpoint whose attention quant method is a BaseKVCacheMethod, including compressed-tensors weight-only (INT4/INT8/AWQ/ GPTQ) checkpoints that carry no fp8 KV scales. On Ampere, where fp8_e5m2 is the only usable fp8 KV cache dtype, this made fp8 KV cache unreachable for any weight-quantized model. Narrow the gate so it only fires when the checkpoint actually stores fp8 KV scales: compressed-tensors checkpoints only do so when they declare a kv_cache_scheme; the other KV cache methods (Fp8KVCacheMethod, ModelOptKVCacheMethod, QuarkKVCacheMethod) are only built for fp8/nvfp4 checkpoints and keep being rejected. Fixes vllm-project#39137 Signed-off-by: Ting Sun <suntcrick@gmail.com>
Contributor
Author
|
@yewentao256 PTAL, thanks~ |
yewentao256
reviewed
Jun 17, 2026
yewentao256
left a comment
Member
There was a problem hiding this comment.
Thanks for the work!
Could you add full reproduce command in main and full error report in PR description? And logs in this PR the error got fixed
| ) | ||
|
|
||
|
|
||
| def _checkpoint_has_fp8_kv_scales(quant_method: QuantizeMethodBase) -> bool: |
Member
There was a problem hiding this comment.
Maybe inline this function as just used once?
Inline the _checkpoint_has_fp8_kv_scales helper into its only caller, _init_kv_cache_quant. The unit test that imported the helper is removed with it; the reproduce in the PR description covers the gate. Signed-off-by: Ting Sun <suntcrick@gmail.com>
Contributor
Author
|
Thanks @yewentao256, addressed both:
|
quant_config is typed as the base QuantizationConfig on BaseKVCacheMethod, so accessing kv_cache_scheme tripped mypy even after narrowing quant_method to CompressedTensorsKVCacheMethod. Cast to CompressedTensorsConfig at the access site; behavior is unchanged. Signed-off-by: Ting Sun <suntcrick@gmail.com>
Contributor
Author
|
Hi @yewentao256, any updates on the review? |
yewentao256
approved these changes
Jun 18, 2026
yewentao256
left a comment
Member
There was a problem hiding this comment.
LGTM, thanks for the work!
Contributor
Author
|
@yewentao256 Hi Wentao, could you please auto-merge it? |
divineearthly
pushed a commit
to divineearthly/vllm
that referenced
this pull request
Jun 19, 2026
…ntized checkpoints (vllm-project#45040) Signed-off-by: Ting Sun <suntcrick@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: divineearthly <divineearthly@gmail.com>
xuebwang-amd
pushed a commit
to xuebwang-amd/vllm
that referenced
this pull request
Jun 21, 2026
…ntized checkpoints (vllm-project#45040) Signed-off-by: Ting Sun <suntcrick@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
tunglinwood
pushed a commit
to tunglinwood/vllm
that referenced
this pull request
Jun 22, 2026
…ntized checkpoints (vllm-project#45040) Signed-off-by: Ting Sun <suntcrick@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
nkzhenhua
pushed a commit
to nkzhenhua/vllm
that referenced
this pull request
Jun 24, 2026
…ntized checkpoints (vllm-project#45040) Signed-off-by: Ting Sun <suntcrick@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Closes #39137.
_init_kv_cache_quantrejects--kv-cache-dtype fp8_e5m2for every quantized checkpoint with "fp8_e5m2 kv-cache is not supported with fp8 checkpoints". But weight-only checkpoints (INT4/INT8 compressed-tensors, AWQ, GPTQ) carry no fp8 KV scales and are not fp8. On Ampere, wherefp8_e5m2is the only fp8 KV cache dtype that runs at all (fp8_e4m3is unsupported by the hardware), this makes fp8 KV cache unreachable for every weight-quantized model.The gate now fires only when the checkpoint actually stores fp8 KV scales. A compressed-tensors checkpoint does so only when it declares a
kv_cache_scheme; weight-only ones declare none. Genuine fp8 checkpoints (including compressed-tensors fp8-KV) stay rejected, so a class-only whitelist or a blanket compressed-tensors exemption would be wrong.This deliberately leaves the separate query-quant
fp8_e5m2assert inAttention.forwardalone. On Ampere, allowing it there only moves the crash intoreshape_and_cache_kernel_flash, since the query-quant path uses e4m3, which Triton does not support on SM 8.x.Related PRs
BaseKVCacheMethod, which does not help becauseCompressedTensorsKVCacheMethodis itself a subclass.Test Plan
Reproduce with a weight-only W4A16 compressed-tensors checkpoint (no
kv_cache_scheme) andfp8_e5m2KV cache. The gate is a checkpoint-config check, so it fires identically on any CUDA device; the logs below are from an RTX 4090 (SM 8.9), and the fix was first validated on an A800 (SM 8.0), the Ampere case the Purpose describes.Equivalent server:
vllm serve nm-testing/tinyllama-oneshot-w4a16-channel-v2 --kv-cache-dtype fp8_e5m2Test Result
Same model, command, and environment for both runs; the only difference is the gate in
_init_kv_cache_quant.Before (main): engine core init fails at the gate, matching the report.
Full error
After (this PR): the engine initializes with fp8_e5m2 KV cache active and generates normally.
Fixed run
A weight-only compressed-tensors checkpoint is now allowed with
fp8_e5m2; one declaring an fp8kv_cache_schemestays rejected.ruff check/ruff formatpass.AI assistance was used to investigate, reproduce, and draft this change; the author reviewed the diff and validation output.