[v1][kvcache] Honor prefix-cache retention interval for Mamba/linear attention by Dao007forever · Pull Request #45845 · vllm-project/vllm

Dao007forever · 2026-06-16T16:52:35Z

Purpose

Wire VLLM_PREFIX_CACHE_RETENTION_INTERVAL to Mamba/linear-attention KV-cache groups, completing the in-code # TODO: Support Mamba/linear attention left by #43447 (which added the mechanism for sliding-window attention only).

Background. Mamba/KDA prefix caching retains a full recurrent-state snapshot once per block_size-token boundary. At small attention block sizes (e.g. 128) each snapshot spans several base blocks, and dense per-boundary retention saturates the KV pool — at block_size 128 the Mamba snapshots occupy ~80% of the blocks — leaving no uncached headroom, so the allocator is forced to evict live attention prefixes. The prefix-cache hit rate then collapses late in long multi-turn runs, dragging down throughput and tail latency, while larger block sizes (256/512) are unaffected (this is in a test setting which allows shorter block-size)

Change. MambaManager.reachable_block_mask now sparsifies state-snapshot retention the same way SlidingWindowManager does: keep one cached state per retention_interval-sized segment (plus the latest replay boundary) instead of one per block. A hit resumes from the nearest retained boundary (at most retention_interval tokens coarser), costing negligible extra prefill while freeing the intermediate snapshots for reuse. Also:

MambaManager.cache_blocks now tolerates sparse (unhashed) blocks in the cached range.
_validate_prefix_cache_retention_interval now accepts models with a Mamba group.

Default behavior is unchanged: with the interval unset, Mamba caches densely (every boundary), exactly as before.

Why this is not a duplicate

[Prefix Caching] DeepSeekv4 - Support selective prefix-cache retention for sliding-window KV cache #43447 (merged) added VLLM_PREFIX_CACHE_RETENTION_INTERVAL for sliding-window KV cache only, and explicitly left # TODO: Support Mamba/linear attention. This PR completes that TODO.
[Feature] Context-Aware KV-Cache Retention API (#37003) #38514 is a separate feature (a request-level retention-directive API + priority eviction queue); it does not touch the env-var interval path or reachable_block_mask.
No other open PR wires retention-interval sparsification into Mamba/linear-attention groups.

Test Plan

.venv/bin/python -m pytest tests/v1/core/test_prefix_caching.py \
  -k "retention or reachable" -v

.venv/bin/python -m pytest tests/v1/core/ \
  -k "retention or mamba or prefix_cach" -q

E2E: Kimi-Linear-48B-A3B-Instruct (block_size 512, TP4, multi-turn prefix-on, 50 prompts × 60 turns), comparing default vs VLLM_PREFIX_CACHE_RETENTION_INTERVAL=2048.

Test Result

Unit tests (CPU; this change is pure-Python KV-cache scheduling logic):

test_prefix_caching.py -k "retention or reachable" — including the new test_mamba_reachable_block_mask_sparsifies_retention — 10 passed.
tests/v1/core/ -k "retention or mamba or prefix_cach" — all change-relevant tests pass in a CPU-only env; the remaining e2e/flash-attn tests in this selection require a full CUDA build + GPU and were not run here.

E2E: validated on Kimi-Linear-48B-A3B-Instruct with the config above; with VLLM_PREFIX_CACHE_RETENTION_INTERVAL=2048

AI assistance disclosure: AI assistance (Claude) was used for this change. The human submitter has reviewed every changed line and run the tests above.

🤖 Generated with Claude Code

…attention Wire VLLM_PREFIX_CACHE_RETENTION_INTERVAL to Mamba groups, completing the existing `# TODO: Support Mamba/linear attention` (only sliding-window attention honored it before). Mamba/KDA prefix caching retains a full recurrent-state snapshot once per block_size-token boundary. At small attention block sizes (e.g. 128 under decoupled hybrid paging) each snapshot spans several base blocks, and dense per-boundary retention saturates the KV pool — at block_size 128 the Mamba snapshots occupy ~80% of the blocks — leaving no uncached headroom, so the allocator is forced to evict live attention prefixes. The prefix-cache hit rate then collapses late in long multi-turn runs (~85%, down to ~75% under load) with ~18% lower throughput and ~3x worse p99, while larger block sizes (256/512) are unaffected. MambaManager.reachable_block_mask now sparsifies state-snapshot retention the same way SlidingWindowManager does: keep one cached state per retention_interval-sized segment (plus the latest replay boundary) instead of one per block. A hit resumes from the nearest retained boundary (at most retention_interval tokens coarser), costing negligible extra prefill while freeing the intermediate snapshots for reuse. Also fixes MambaManager.cache_blocks to tolerate sparse (unhashed) blocks in the cached range, and relaxes _validate_prefix_cache_retention_interval to accept models with a Mamba group. Validated on Kimi-Linear-48B-A3B-Instruct (decoupled hybrid paging, block_size 128, TP4, multi-turn prefix-on, 50 prompts x 60 turns): with VLLM_PREFIX_CACHE_RETENTION_INTERVAL=512 the prefix-cache hit rate recovers to 98.5% (parity with block_size 512), throughput to ~200K tok/s, with zero failed requests. Default behavior (interval unset) is unchanged: Mamba caches densely. Test commands run: .venv/bin/python -m pytest tests/v1/core/test_prefix_caching.py \ -k "retention or reachable" -v # 10 passed .venv/bin/python -m pytest tests/v1/core/ \ -k "retention or decoupled or buddy or mamba or prefix_cach" -q # 136 passed This is not a duplicate: no open PR wires retention-interval sparsification to Mamba/linear-attention groups (the codebase carried it as a TODO). AI assistance (Claude) was used for this change; the human submitter has reviewed every changed line. Signed-off-by: Dao Le <daole@inferact.ai> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Dao Le <Dao007forever@gmail.com>

Signed-off-by: Dao Le <Dao007forever@gmail.com>

wzhao18 · 2026-06-16T20:21:59Z

Looks good to me. Thanks!

Just need to update the description for VLLM_PREFIX_CACHE_RETENTION_INTERVAL:

vllm/vllm/envs.py

Line 1055 in 475a6ad

    
           # Applies to sliding-window attention for now but not yet Mamba/linear attention.

to include mamba.

ivanium

Make sense to me at a high level. Left one comment on mamba+spec decoding. Also cc @ZJY0516

QilaiZhang · 2026-06-17T07:16:10Z

Thanks for the PR. I have a small clarification question. My understanding is that Kimi-Linear-48B-A3B-Instruct prefix caching currently only supports align mode, and in align mode it does not save a recurrent-state snapshot at every block_size boundary. I may be missing something here, though. Could you explain how this applies to the Kimi-Linear benchmark case?

Dao007forever · 2026-06-17T16:52:44Z

Hi @QilaiZhang, in align-mode, it still deposits a snapshot every block_size tokens. Those linger as unreferenced-but-cached blocks and accumulate as context grows. This causes block reused and a drop in prefix cache hit rate. Live footprint stays ~2 blocks; cached footprint grows ~1 per block_size tokens, which is what the retention interval thins.

(I was running in a test setting which work around the uniform block-size of HMA.)

QilaiZhang · 2026-06-18T01:20:37Z

@Dao007forever Thanks, that makes sense regarding freed state blocks remaining as unreferenced cached blocks.

One follow-up: in upstream align mode, _mamba_block_aligned_split() seems to align scheduled chunks to a multiple of block_size, but not necessarily force one scheduled chunk per block. If a scheduled chunk spans multiple blocks, intermediate entries are null_blocks and only the chunk-end state is cached.

Was your Kimi-Linear benchmark configured so that aligned prefill chunks are effectively one block_size each, perhaps due to the HMA workaround you mentioned? That would explain the “one snapshot per block_size tokens” behavior.

Dao007forever · 2026-06-18T04:09:54Z

You are right that in prefill, we snapshot per chunk end, but in decode, we snapshot every block size still.

QilaiZhang · 2026-06-18T06:05:18Z

Thanks, that makes sense. I was only thinking about the prefill path and missed the decode-side behavior across block boundaries. The distinction between live footprint and unreferenced-but-cached footprint is helpful. Thanks for clarifying.

ivanium

One quick comment

ivanium · 2026-06-22T20:32:43Z

+        # (2) Replay boundary. ``get_computed_blocks`` caps hits at
+        # ``num_prompt - 1``, so an exact prompt replay lands on the latest
+        # fine-aligned boundary. Sparse retention would otherwise skip its
+        # state, so keep it explicitly.
+        if num_prompt_tokens is not None:
+            latest = (num_prompt_tokens - 1) // alignment_tokens * alignment_tokens
+            boundary_block = latest // block_size - 1
+            if start_block <= boundary_block < end_block:
+                mask[boundary_block - start_block] = True
+
+        return mask


not sure if this part can work because for mamba, we need scheduler side changes to cache the end of the prompt. Maybe we can raise NotImplementedError when num_prompt_tokens is given?

Good catch to raise it! I think we're safe here though — the mask is purely subtractive. A True never forces a block to be cached, it just declines to skip it; the real gate is blk.is_null in cache_full_blocks:

if blk.is_null or (block_mask is not None and not block_mask[i]): continue

So the layers stay separate: the scheduler decides where a snapshot lands (which blocks are non-null), and the mask just picks which existing ones to keep. If there's no state at the boundary, it's a null_block and gets skipped regardless of the mask — so we can never fabricate a cache entry over a stateless block. Worst case without scheduler changes is that the branch is inert, not incorrect (and in decode we snapshot every block_size, so it usually does fire).

Make sense!

ivanium · 2026-06-22T23:03:34Z

+        # (2) Replay boundary. ``get_computed_blocks`` caps hits at
+        # ``num_prompt - 1``, so an exact prompt replay lands on the latest
+        # fine-aligned boundary. Sparse retention would otherwise skip its
+        # state, so keep it explicitly.
+        if num_prompt_tokens is not None:
+            latest = (num_prompt_tokens - 1) // alignment_tokens * alignment_tokens
+            boundary_block = latest // block_size - 1
+            if start_block <= boundary_block < end_block:
+                mask[boundary_block - start_block] = True
+
+        return mask


Make sense!

ivanium

LGTM!

…attention (vllm-project#45845) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…attention (vllm-project#45845) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

Dao007forever requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat and ywang96 as code owners June 16, 2026 16:52

mergify Bot added the v1 label Jun 16, 2026

Dao007forever force-pushed the mamba-retention-interval branch from 2065198 to e5df62d Compare June 16, 2026 17:34

Dao007forever and others added 2 commits June 16, 2026 10:48

Simplify

4268109

Signed-off-by: Dao Le <Dao007forever@gmail.com>

Merge branch 'main' into mamba-retention-interval

c247575

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 16, 2026

ivanium reviewed Jun 16, 2026

View reviewed changes

Comment thread vllm/v1/core/single_type_kv_cache_manager.py

Merge branch 'main' into mamba-retention-interval

1dd93f5

ivanium reviewed Jun 22, 2026

View reviewed changes

ivanium approved these changes Jun 22, 2026

View reviewed changes

ivanium approved these changes Jun 23, 2026

View reviewed changes

WoosukKwon merged commit 430a95a into vllm-project:main Jun 23, 2026
54 of 56 checks passed

underfituu mentioned this pull request Jul 1, 2026

[RFC]: Improve Prefix-Caching Hit Rate for Hybrid Models vllm-project/vllm-ascend#10517

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[v1][kvcache] Honor prefix-cache retention interval for Mamba/linear attention#45845

[v1][kvcache] Honor prefix-cache retention interval for Mamba/linear attention#45845
WoosukKwon merged 4 commits into
vllm-project:mainfrom
Dao007forever:mamba-retention-interval

Dao007forever commented Jun 16, 2026 •

edited

Loading

wzhao18 commented Jun 16, 2026

ivanium left a comment

Uh oh!

QilaiZhang commented Jun 17, 2026

Dao007forever commented Jun 17, 2026 •

edited

Loading

QilaiZhang commented Jun 18, 2026

Dao007forever commented Jun 18, 2026

QilaiZhang commented Jun 18, 2026

ivanium left a comment

ivanium Jun 22, 2026

Dao007forever Jun 22, 2026 •

edited

Loading

ivanium Jun 22, 2026

ivanium Jun 22, 2026

ivanium left a comment

Uh oh!

Labels

5 participants

Uh oh!

Uh oh!

Conversation

Dao007forever commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Why this is not a duplicate

Test Plan

Test Result

wzhao18 commented Jun 16, 2026

ivanium left a comment

Choose a reason for hiding this comment

Uh oh!

QilaiZhang commented Jun 17, 2026

Dao007forever commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

QilaiZhang commented Jun 18, 2026

Dao007forever commented Jun 18, 2026

QilaiZhang commented Jun 18, 2026

ivanium left a comment

Choose a reason for hiding this comment

ivanium Jun 22, 2026

Choose a reason for hiding this comment

Dao007forever Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

ivanium Jun 22, 2026

Choose a reason for hiding this comment

ivanium Jun 22, 2026

Choose a reason for hiding this comment

ivanium left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

5 participants

Dao007forever commented Jun 16, 2026 •

edited

Loading

Dao007forever commented Jun 17, 2026 •

edited

Loading

Dao007forever Jun 22, 2026 •

edited

Loading